scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 507 forks source link

HDBSCAN with and without TSNE(or any dimensionality reductions) #215

Open djt61 opened 6 years ago

djt61 commented 6 years ago

Hello, I have been using HDBSCAN for exploratory data analysis. The data usually has between 10-20 dimensions. I tried to do clustering analysis with and without dimensionality reduction as creating plots of data is not as important as the interpreting the accuracy and hierarchy of clusters. What would be the reason to get different clustering results for raw data compared to data with TSNE projection? I am not sure which clustering result should I interpret and use moving forward. I have also tried largevis and got different clustering results.

Raw Data1: panc05_02_condensedtree

TSNE Data1: panc05_02_condensedtree_tsne

Raw Data2: panc05_03_condensedtree

TSNE Data2: panc05_03_condensedtree_tsne

lmcinnes commented 6 years ago

The short answer is that t-SNE preserves local structure, but doesn't necessarily preserve global structure, nor density structure. Specifically it has a tendency to draw points in to well defined clumps even if they are outlying (this is the nature of the variable width Gaussians it uses based on perplexity -- sparse points get larger/wider Gaussians and so connect with the closest denser area and get pulled in). You can see that result in the plots -- t-SNE has far less noise noise (clusters tend to lose few points before splitting apart). This need not be a bad thing, but it does mean you need to verify your clusters carefully to be sure the results are not artifacts of t-SNE. This stackoverflow post has some useful discussion of potential issues using t-SNE as a preprocessing step for clustering. It is not that it can't work -- rather that with a bad perplexity value it can introduce artifacts easily, so you do need some way to validate the results (by hand if necessary).

mtngld commented 6 years ago

This is a great read on this topic:

https://distill.pub/2016/misread-tsne/

djt61 commented 6 years ago

Thank you for the answers and links to discussion post. The data on which we perform clustering is highly exploratory and derived from biological experiments with possibility to defy given literature in biology, there is no proper way to validate it and we heavily rely on accuracy of clustering results. Thus it would be safe to say that using t-SNE as processioning step for clustering is not advisable in this application.

Would you be able to suggest any algorithms to visualize clustering results? I did to read your paper on UMAP. I can do similar comparisons with UMAP or any other algorithm and get back to you.

Minyall commented 6 years ago

I’ve been running some trials on a set of Doc2Vec generated vectors (300 dimensions) and have noted that if I used tsne to reduce this to 3 dimensions the clustering works ok, but about 50-60% of the points are classified as noise. Given the concerns above about Tsne I tried using PCA and TSVD from sklearn reducing only to 50 dimensions but found that using these techniques would always give me the error here

https://github.com/scikit-learn-contrib/hdbscan/issues/151

.... whatever the minimum cluster size or minimum sample size.

Does hdbscan only work with tsne?

lmcinnes commented 6 years ago

It most certainly doesn't only work with t-SNE. What could be causing these issues is far less clear to me. I would expect PCA and TSVD to be essentially equivalent and quite safe for these purposes. It may be that something weird is happening in the PCA/TSVD that results in oddness in the resulting reduced dimension data (NaNs? multiple duplicated points? All zero vectors?).

Minyall commented 6 years ago

I’ll have a look at the output of the pca and tsvd reductions and see if there are any odd results as you suggest. Will report back.

Minyall commented 6 years ago

So I ran

pd.DataFrame(tsvd_coords).duplicated().any()
pd.np.isnan(tsvd_coords).any()
pd.np.where(~tsvd_coords.any(axis=1))

For the TSVD, PCA and t-SNE produced arrays and all came up False. I'd also note that for the t-SNE array, the error was triggered when using some combinations of min_cluster_size andmin_samples. Unfortunately my trials didn't record what the parameter values where when the errors occurred, but it indicates that it can occur with the t-SNE array when certain parameter values are given, whilst with TSVD and PCA it was every variation of parameter that gave the error.

If it helps I can provide the t-SNE/PCA/TSVD dimension data.

*Edit I was wrong I did record the parameters.

min_cluster_size-min_samples
3000-14
4000-14
5000-14
5000-16
5000-18

The trials for t-SNE covered min_cluster_size from 100-1000 in increments of 100, and 1000 to 5000 in increments of 1000. For each min_cluster_size a min_sample of between 2 and 20 in increments of 2 was tested. The specific zero-size array to reduction operation minimum which has no identity error occurred only in the combinations above.

lmcinnes commented 6 years ago

For any dataset there will exist values of min_cluster_size and min_samples that will cause errors such as this, although they should usually be rather strange values (excessively large compared to the total dataset size). As to what could be causing the issue here -- it is not clear to me. You have eliminated the most obvious possibilities, so clearly there is something more subtle going on. If you can provide the reduced dimension data for me to test with I can try to track down what is actually causing the issue.

On Sat, Sep 1, 2018 at 5:47 PM James Allen-Robertson < notifications@github.com> wrote:

So I ran

pd.DataFrame(tsvd_coords).duplicated().any() pd.np.isnan(tsvd_coords).any() pd.np.where(~tsvd_coords.any(axis=1))

For the TSVD, PCA and t-SNE produced arrays and all came up False. I'd also note that for the t-SNE array, the error was triggered when using some combinations of min_cluster_size andmin_samples. Unfortunately my trials didn't record what the parameter values where when the errors occurred, but it indicates that it can occur with the t-SNE array when certain parameter values are given, whilst with TSVD and PCA it was every variation of parameter that gave the error.

If it helps I can provide the t-SNE/PCA/TSVD dimension data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/215#issuecomment-417889295, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBUC8NLZFKBY9hXRQCKf8hiFA9nviks5uWwCPgaJpZM4UfHDK .

Minyall commented 6 years ago

I've saved the arrays using np.save and zipped all three into an archive. It's a little large for here (21MB) so you can download from Dropbox.

https://www.dropbox.com/s/bi317up7gu9ti4v/output_archive.zip?dl=0

Thanks for the continued investigation!