scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 502 forks source link

Too much noise found #72

Open Phlya opened 8 years ago

Phlya commented 8 years ago

Hi,

I am having a problem with clustering a data set. The data has been extensively filtered before clustering to remove uninteresting samples, so I would expect that the vast majority of samples clusters with some other samples. If I try other clustering algorithms which simply separate the dataset into clusters, with reasonable number of clusters, I see that all clusters make sense and there is (almost) no noise. However with all min_cluster_size and min_samples parameters I have tried hdbscan considers a lot of samples (~1/4-1/3) as noise. I can clearly see that there is structure in that noise by eye too... Is there anything else to do about it?

I'm attaching a seaborn clustermap to show that there is no real noise in the data, and what I can get with HDBSCAN (with the leftmost cluster being what is detected as noise) clustermap hdbscan_clusters

More what I expect is produced by Agglomerative Clustering: agglomerativeclustring_fromcormatrix_11_clusters

Is it possible to force all points to the nearest cluster, for example?

lmcinnes commented 8 years ago

I am currently working on code that can provide a membership vector, proving the probability that a given point is in each of the found clusters. This is currently in the prediction branch and not complete yet. It might satisfy your desire to assign everything.

The other alternative is to simply access the single_linkagetree attribute, which can provide you with an uncondensed tree that is akin to robust single linkage. This should provide you with access to information that should provide something more equivalent to the hierarchical clusterings.

I'm travelling at the moment, so I can't get into too many details. Hopefully some of my colleagues can tackle this in a little more detail.

On Sat, Nov 5, 2016 at 9:19 AM, Ilya Flyamer notifications@github.com wrote:

Hi,

I am having a problem with clustering a data set. The data has been extensively filtered before clustering to remove uninteresting samples, so I would expect that the vast majority of samples clusters with some other samples. If I try other clustering algorithms which simply separate the dataset into clusters, with reasonable number of clusters, I see that all clusters make sense and there is (almost) no noise. However with all min_cluster_size and min_samples parameters I have tried hdbscan considers a lot of samples (~1/4-1/3) as noise. I can clearly see that there is structure in that noise by eye too... Is there anything else to do about it?

I'm attaching a seaborn clustermap to show that there is no real noise in the data, and what I can get with HDBSCAN (with the leftmost cluster being what is detected as noise) [image: clustermap] https://cloud.githubusercontent.com/assets/2895034/20030282/ed2e5d24-a359-11e6-8cd9-2dbff65d1cab.png [image: hdbscan_clusters] https://cloud.githubusercontent.com/assets/2895034/20030299/1f251336-a35a-11e6-8d10-8dfe2caa615a.png

More what I expect is produced by Agglomerative Clustering: [image: agglomerativeclustring_fromcormatrix_11_clusters] https://cloud.githubusercontent.com/assets/2895034/20030278/e92e1386-a359-11e6-8153-4411f4c5f89c.png

Is it possible to force all points to the nearest cluster, for example?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBWkYn-uoISODYp8Qf_y-uA2TOsR9ks5q7IJ0gaJpZM4KqQme .

lmcinnes commented 8 years ago

I should point out that I agree that the results you are seeing are less than ideal, but I would need to know a little more about the data to start to understand why that might be the case. It would be nice to getter better results here.

Phlya commented 7 years ago

Thanks for the answer! Concerning the data - this is a correlation matrix with ~5000 rows and columns. What else would you like to know about it? I think I could share it probably... (I tried clustering the raw data and not a correlation matrix, and the results were only worse, which I see by silhouette score) Would using single_linkage_tree from hdbscan have any advantage over hierarchical clustering?

lmcinnes commented 7 years ago

I would be wary of banking too much on the silhouette score -- it can get a little strange if you have a lot of noise. The "single linkage tree" is actually a robust single linkage, so yes, it has some advantages of standard single linkage hierarchical clustering, most particularly it is resistant to noise. It may well be the case, however, that your particular dataset works best with standard hierarchical clustering. In that case please do check into shareability, because I'm always keen to see/have examples of hdbscan not working well so I can try to improve it.

On Mon, Nov 7, 2016 at 6:48 AM, Ilya Flyamer notifications@github.com wrote:

Thanks for the answer! Concerning the data - this is a correlation matrix with ~5000 rows and columns. What else would you like to know about it? I think I could share it probably... (I tried clustering the raw data and not a correlation matrix, and the results were only worse, which I see by silhouette score) Would using single_linkage_tree from hdbscan have any advantage over hierarchical clustering?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/72#issuecomment-258815682, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBVjL5Oo91yMaXZUROOWPwHZKwJPBks5q7xAAgaJpZM4KqQme .

Phlya commented 7 years ago

Is there anything better? Doesn't look like there is much noise really anyway I think? OK, I see, thanks! I'll look into sharing the data, I'll have to check with someone else about this.

Phlya commented 7 years ago

OK, I can share the data with you. This is the raw matrix ~5000 by 10, the figures I showed were correlation matrices of all rows vs all rows (which is quite big, so it's easier to re-create it on your side). Let me know if you can get it to work better with this data! data.tsv.zip

lmcinnes commented 7 years ago

Thanks. I'll take a look at it when I get a chance. Unfortunately I'm travelling at the moment so I don't have as much time at the moment. I'll let you know if I can do anything with the data.

On Tue, Nov 8, 2016 at 9:45 AM, Ilya Flyamer notifications@github.com wrote:

OK, I can share the data with you. This is the raw matrix ~5000 by 10, the figures I showed were correlation matrices of all rows vs all rows (which is quite big, so it's easier to re-create it on your side). Let me know if you can get it to work better with this data! data.tsv.zip https://github.com/scikit-learn-contrib/hdbscan/files/578440/data.tsv.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/72#issuecomment-259154863, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBTbYHo72xsDm8vxn_IqO5TFcEGzFks5q8Ir-gaJpZM4KqQme .

michaelaye commented 7 years ago

I'm wondering how this membership probability vector differs from clusterer.probabilities_?

lmcinnes commented 7 years ago

The goal of the membership probability vector (which is getting closer to landing in master at last) is to provide, for each point, a vector of probabilities of it being a member of each cluster. This includes noise points, which are not assigned to clusters, but you potentially want to know what their "most likely" cluster would be etc.

On Sun, Jan 8, 2017 at 7:51 PM, K.-Michael Aye notifications@github.com wrote:

I'm wondering how this membership probability vector differs from clusterer.probabilities_?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/72#issuecomment-271194475, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBWxesqs8CElBXXNgHg0qevnlhK1Lks5rQYSXgaJpZM4KqQme .

kgullikson88 commented 7 years ago

I have a similar issue with my dataset, in that the most common "cluster" is often -1 (noise). You mentioned above:

I am currently working on code that can provide a membership vector, proving the probability that a given point is in each of the found clusters. This is currently in the prediction branch and not complete yet. It might satisfy your desire to assign everything.

I wonder if you have any ETA for when that branch will be considered stable/merged into master, or if there is some way to use the single linkage or condensed tree to estimate the "closest" cluster for the noise points (even if it doesn't come with a probability)

lmcinnes commented 7 years ago

It's coming fairly soon. I can't make any promises at this time, but I would really like to have it arrive in February or March. I understand this is fairly high priority as there are a number of requests for this, so I'll try to get it done ASAP.

On Mon, Jan 30, 2017 at 5:26 PM, Kevin Gullikson notifications@github.com wrote:

I have a similar issue with my dataset, in that the most common "cluster" is often -1 (noise). You mentioned above:

I am currently working on code that can provide a membership vector, proving the probability that a given point is in each of the found clusters. This is currently in the prediction branch and not complete yet. It might satisfy your desire to assign everything.

I wonder if you have any ETA for when that branch will be considered stable/merged into master, or if there is some way to use the single linkage or condensed tree to estimate the "closest" cluster for the noise points (even if it doesn't come with a probability)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/72#issuecomment-276211846, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBd7ZojoQzWjCncU8tf92J5ku0cVbks5rXmOCgaJpZM4KqQme .

lmcinnes commented 7 years ago

I've merged in the prediction branch which includes soft clustering. It is still "experimental" but I believe it should work. I'd be keen to have some people try it out, so anyone interested in this please take a moment and clone from master and experiment. The relevant new routines are

approximate_predict membership_vector all_points_membership_vectors

The docstrings associated should give you an idea of how to use them in the meantime while I get some proper documentation/tutorial material written.

thoth291 commented 5 years ago

I wonder what have happened to those routines. Have anyone tested them? They not pushed to master - as far as I can tell...

lmcinnes commented 5 years ago

They are essentially just the soft clustering routines. Unfortunately there seem to be some odd bugs that I have never had the time to track down. They may well work for you case.

thoth291 commented 5 years ago

So if I understood correctly then this: membership_vector *= prob_in_some_cluster(x, tree, cluster_ids, point_dict, max_lambda_dict) should give me a membership to each cluster. Does it includes missing (noise a.k.a. cluster=-1) points? It looks like I need to redo entire notebook now in order to see if I get what I need in the first place... Since I'll be looking into it - no promises - but I can check if I can catch a bug you mentioned - any reproducible example?