The Short-Text Analyzer is created to help analyze the open-ended survey response which usually has less than three sentences. The analysis includes topic modeling, sentiment analysis, and visualization. This topic modeling was done using pre-trained representations of language, namely BERT, combine with the clustering algorithm.
Apache License 2.0
3
stars
2
forks
source link
Replacing the UMAP with autoencoder for the dimension reduction. #1
Cluster sizes in a UMAP means nothing. I.e. the size of clusters relative to each other is meaningless.
UMAP doesn't preserve distance even though it is better than the T-SNE.
The UMAP method will be replaced by an autoencoder since UMAP does not preserve distances nor density. This is a problem is quite severe since we use HDBSCAN and KMeans for the clustering task. HDBSSCAN is a hierarchical density-based algorithm and KMeans is a distance-based algorithm.
One of the pages on UMAP's official documentation shows that UMAP can be used for clustering, https://umap-learn.readthedocs.io/en/latest/clustering.html.
However, from the "Understanding UMAP" article, https://pair-code.github.io/understanding-umap/. We found that:
The UMAP method will be replaced by an autoencoder since UMAP does not preserve distances nor density. This is a problem is quite severe since we use HDBSCAN and KMeans for the clustering task. HDBSSCAN is a hierarchical density-based algorithm and KMeans is a distance-based algorithm.