Open djt61 opened 6 years ago
The short answer is that t-SNE preserves local structure, but doesn't necessarily preserve global structure, nor density structure. Specifically it has a tendency to draw points in to well defined clumps even if they are outlying (this is the nature of the variable width Gaussians it uses based on perplexity -- sparse points get larger/wider Gaussians and so connect with the closest denser area and get pulled in). You can see that result in the plots -- t-SNE has far less noise noise (clusters tend to lose few points before splitting apart). This need not be a bad thing, but it does mean you need to verify your clusters carefully to be sure the results are not artifacts of t-SNE. This stackoverflow post has some useful discussion of potential issues using t-SNE as a preprocessing step for clustering. It is not that it can't work -- rather that with a bad perplexity value it can introduce artifacts easily, so you do need some way to validate the results (by hand if necessary).
This is a great read on this topic:
Thank you for the answers and links to discussion post. The data on which we perform clustering is highly exploratory and derived from biological experiments with possibility to defy given literature in biology, there is no proper way to validate it and we heavily rely on accuracy of clustering results. Thus it would be safe to say that using t-SNE as processioning step for clustering is not advisable in this application.
Would you be able to suggest any algorithms to visualize clustering results? I did to read your paper on UMAP. I can do similar comparisons with UMAP or any other algorithm and get back to you.
I’ve been running some trials on a set of Doc2Vec generated vectors (300 dimensions) and have noted that if I used tsne to reduce this to 3 dimensions the clustering works ok, but about 50-60% of the points are classified as noise. Given the concerns above about Tsne I tried using PCA and TSVD from sklearn reducing only to 50 dimensions but found that using these techniques would always give me the error here
https://github.com/scikit-learn-contrib/hdbscan/issues/151
.... whatever the minimum cluster size or minimum sample size.
Does hdbscan only work with tsne?
It most certainly doesn't only work with t-SNE. What could be causing these issues is far less clear to me. I would expect PCA and TSVD to be essentially equivalent and quite safe for these purposes. It may be that something weird is happening in the PCA/TSVD that results in oddness in the resulting reduced dimension data (NaNs? multiple duplicated points? All zero vectors?).
I’ll have a look at the output of the pca and tsvd reductions and see if there are any odd results as you suggest. Will report back.
So I ran
pd.DataFrame(tsvd_coords).duplicated().any()
pd.np.isnan(tsvd_coords).any()
pd.np.where(~tsvd_coords.any(axis=1))
For the TSVD, PCA and t-SNE produced arrays and all came up False. I'd also note that for the t-SNE array, the error was triggered when using some combinations of min_cluster_size
andmin_samples
. Unfortunately my trials didn't record what the parameter values where when the errors occurred, but it indicates that it can occur with the t-SNE array when certain parameter values are given, whilst with TSVD and PCA it was every variation of parameter that gave the error.
If it helps I can provide the t-SNE/PCA/TSVD dimension data.
*Edit I was wrong I did record the parameters.
min_cluster_size-min_samples
3000-14
4000-14
5000-14
5000-16
5000-18
The trials for t-SNE covered min_cluster_size
from 100-1000 in increments of 100, and 1000 to 5000 in increments of 1000. For each min_cluster_size
a min_sample
of between 2 and 20 in increments of 2 was tested. The specific zero-size array to reduction operation minimum which has no identity
error occurred only in the combinations above.
For any dataset there will exist values of min_cluster_size
and
min_samples
that will cause errors such as this, although they should
usually be rather strange values (excessively large compared to the total
dataset size). As to what could be causing the issue here -- it is not
clear to me. You have eliminated the most obvious possibilities, so clearly
there is something more subtle going on. If you can provide the reduced
dimension data for me to test with I can try to track down what is actually
causing the issue.
On Sat, Sep 1, 2018 at 5:47 PM James Allen-Robertson < notifications@github.com> wrote:
So I ran
pd.DataFrame(tsvd_coords).duplicated().any() pd.np.isnan(tsvd_coords).any() pd.np.where(~tsvd_coords.any(axis=1))
For the TSVD, PCA and t-SNE produced arrays and all came up False. I'd also note that for the t-SNE array, the error was triggered when using some combinations of min_cluster_size andmin_samples. Unfortunately my trials didn't record what the parameter values where when the errors occurred, but it indicates that it can occur with the t-SNE array when certain parameter values are given, whilst with TSVD and PCA it was every variation of parameter that gave the error.
If it helps I can provide the t-SNE/PCA/TSVD dimension data.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/215#issuecomment-417889295, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBUC8NLZFKBY9hXRQCKf8hiFA9nviks5uWwCPgaJpZM4UfHDK .
I've saved the arrays using np.save
and zipped all three into an archive. It's a little large for here (21MB) so you can download from Dropbox.
https://www.dropbox.com/s/bi317up7gu9ti4v/output_archive.zip?dl=0
Thanks for the continued investigation!
Hello, I have been using HDBSCAN for exploratory data analysis. The data usually has between 10-20 dimensions. I tried to do clustering analysis with and without dimensionality reduction as creating plots of data is not as important as the interpreting the accuracy and hierarchy of clusters. What would be the reason to get different clustering results for raw data compared to data with TSNE projection? I am not sure which clustering result should I interpret and use moving forward. I have also tried largevis and got different clustering results.
Raw Data1:
TSNE Data1:
Raw Data2:
TSNE Data2: