thisisphume / short_text_analyzer

The Short-Text Analyzer is created to help analyze the open-ended survey response which usually has less than three sentences. The analysis includes topic modeling, sentiment analysis, and visualization. This topic modeling was done using pre-trained representations of language, namely BERT, combine with the clustering algorithm.
Apache License 2.0
3 stars 2 forks source link

Replacing the UMAP with autoencoder for the dimension reduction. #1

Closed thisisphume closed 3 years ago

thisisphume commented 3 years ago

One of the pages on UMAP's official documentation shows that UMAP can be used for clustering, https://umap-learn.readthedocs.io/en/latest/clustering.html.

However, from the "Understanding UMAP" article, https://pair-code.github.io/understanding-umap/. We found that:

  1. Cluster sizes in a UMAP means nothing. I.e. the size of clusters relative to each other is meaningless.
  2. UMAP doesn't preserve distance even though it is better than the T-SNE.

The UMAP method will be replaced by an autoencoder since UMAP does not preserve distances nor density. This is a problem is quite severe since we use HDBSCAN and KMeans for the clustering task. HDBSSCAN is a hierarchical density-based algorithm and KMeans is a distance-based algorithm.

thisisphume commented 3 years ago

Replacing the UMAP with PCA or Autoencoder via thisisphume/embedding_tool.