lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.2k stars 783 forks source link

hdbscan on UMAP subspace #57

Open ahy1221 opened 6 years ago

ahy1221 commented 6 years ago

As the doc said "With a little care (documentation on how to be careful is coming) it partners well with the hdbscan clustering library" I wonder is there any updates about the little care or quick answers how to use hdbscan to perform clustering on the UMAP subspace ? Thanks in advance !

lmcinnes commented 6 years ago

Docs are still a work in progress - -I've been diverted by a number of other things. You can look at issue #25 which has some discussion of what works and some of the potential pitfalls. Right now that's the best documentation there is.

birdsarah commented 5 years ago

Another follow-up question.

I had problems using hdbscan on my UMAP embedding, but I may be interpreting things wrong.

I colored all my clusters with a random color or red if hdbscan returned -1 (aka uncategorized). Here's an example output. index

Once colored I was surprised to see:

But, my understanding of UMAP is that the 2d representation that we see (if we asked for 2d), is the representation, and that clusters we see with our eyes are the clusters.

If that's true, and that's the big if, that I would really appreciate your input on.

If that's true, then I found that for my data, I needed to not use hdbscan to get clusters labeled in the way that I visually expect. I'm more than happy to share what I did, but would appreciate clarification on whether I'm understanding correctly.

lmcinnes commented 5 years ago

That an intriguing result! I admit I don't entirely understand what has happened here. Are you clustering in high dimensional space and then coloring the embedding accordingly? If that's the case then something is astray indeed -- there is quite a cloud of points that were assigned to clusters but have been cast out by UMAP. I would love to have more details about what you did, and the data involved, because I feel like there is a failure case in the making here that I might need to investigate.

As to how to interpret things... UMAP isn't really a clustering algorithm, its more for dimension reduction, and it handles noise differently than something like HDBSCAN. In particular it will tend to pull noise in toward whatever is nearest. That means that if your data is not very noisy it can be effective, but for very noisy data it may not do entirely as one might expect. In contrast HDBSCAN is very conservative about noise, which can be problematic for high dimensional data where density is scarce and everything starts to look like noise.

Looking at your pictures here my feeling is that you have some number of clusters which UMAP clumped into tight little blobs, and a fair amount of noise which UMAP didn't really know what to do with and is generally just getting pushed away from (and squished between) the blobs.

I feel like I am not really answering your question well, but I feel like I am not understanding what is happening here either. If you can be patient with me and let me know if I'm on the right track I would appreciate it.

birdsarah commented 5 years ago

Thanks @lmcinnes for pointing me in the right direction offline. A datashaded view of my data shows the areas of density:

dbscan_sample_0_embedding_15_script_netloc_func_name_counts

Here's the results colored by cluster label with HDBScan (default params) results (red is no cluster):

hdbscan_sample_0_embedding_15_script_netloc_func_name_cats

and here's scikit-learn's DBSCAN (default params):

dbscan_sample_0_embedding_15_script_netloc_func_name_cats

In playing with this a lot more, I realized that HDBSCAN, for my data, is very sensitive to the parameters that you pass it. I need to figure out the right params for my use case, and decide whether excluding data, in the way that hdbscan is inclined to do by default is helpful for my needs.