scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.79k stars 501 forks source link

Can't seem to get cosine metric working. #69

Open rn123 opened 8 years ago

rn123 commented 8 years ago

hdbscan==0.8.2 Can't seem to get cosine metric working:

hdbscan.hdbscan_.DistanceMetric.get_metric('cosine') <hdbscan.dist_metrics.ArccosDistance at 0x7f83b22a8c08> So the metric is there, but:

clusterer = hdbscan.HDBSCAN(min_cluster_size=5, metric='cosine') labels = clusterer.fit_predict(np.array(data_d2v))

yields:

ValueError: Unrecognized metric 'cosine'

lmcinnes commented 8 years ago

That is definitely a problem. I'll look into it as soon as I get a chance. In the meantime you can try 'arccos' as a metric, which may work better.

rn123 commented 8 years ago

Same issue with 'arccos', thought it was a link to 'cos'.

lmcinnes commented 8 years ago

It seems that the issue is that we are instantiating ball/kd-trees from sklearn, which does not support such distance metrics (yet). If you have a small enough dataset then adding the keyword argument algorithm='generic' will get you the results you desire. In the meantime I'll see if I can push the relevant metrics upstream to sklearn.

oxymor0n commented 7 years ago

So I'm also trying to use cosine similarity as my distance metric. Here's what I've tried so far:

screen shot 2017-03-09 at 5 39 08 pm

lmcinnes commented 7 years ago

There remain some issues with this -- it's still being worked on. Apologies. In the meantime it is worth noting that if you normalize your vectors then euclidean distance is equal to arccos distance and all is well, so that's the right way to proceed for now.

On Wed, Mar 8, 2017 at 12:41 PM, William Tran notifications@github.com wrote:

So I'm also trying to use cosine similarity as my distance metric. Here's what I've tried so far:

-

Straight up using metric='cosine' doesn't work at all because it's not implemented in sklearn.neighbors.DistanceMetric http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

Using algorithm='generic', metric='cosine' make the .fit method throw ValueError: Buffer dtype mismatch, expected 'double_t' but got 'float'

Precomputing the cosine similarity and then using metric='precomputed' throws the same error

[image: screen shot 2017-03-08 at 12 30 17 pm] https://cloud.githubusercontent.com/assets/2345378/23715732/9b9809ee-03fb-11e7-9ecd-0c2a2cf0a91e.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285112451, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBVOMKBGB5NqenBPQ0uNNTJEk6BlOks5rjuhagaJpZM4KjDT- .

oxymor0n commented 7 years ago

Thanks for the hint, I totally missed that part! :D

Is there anything I could do to help?

lmcinnes commented 7 years ago

If you can figure out the "right" way to make it work I would appreciate it. There are a few approaches and but they each have some drawbacks. I would be happy to know if there is a "better" approach. Pull requests are even better.

On Wed, Mar 8, 2017 at 2:40 PM, William Tran notifications@github.com wrote:

Thanks for the hint, I totally missed that part! :D

Is there anything I could do to help?

From: Leland McInnes notifications@github.com Reply-To: scikit-learn-contrib/hdbscan reply@reply.github.com Date: Wednesday, March 8, 2017 at 2:37 PM To: scikit-learn-contrib/hdbscan hdbscan@noreply.github.com Cc: William Tran viet.tran@outlook.com, Comment < comment@noreply.github.com> Subject: Re: [scikit-learn-contrib/hdbscan] Can't seem to get cosine metric working. (#69)

There remain some issues with this -- it's still being worked on. Apologies. In the meantime it is worth noting that if you normalize your vectors then euclidean distance is equal to arccos distance and all is well, so that's the right way to proceed for now.

On Wed, Mar 8, 2017 at 12:41 PM, William Tran notifications@github.com wrote:

So I'm also trying to use cosine similarity as my distance metric. Here's what I've tried so far:

-

Straight up using metric='cosine' doesn't work at all because it's not implemented in sklearn.neighbors.DistanceMetric http://scikit-learn.org/stable/modules/generated/sklearn.neighbors. DistanceMetric.html

Using algorithm='generic', metric='cosine' make the .fit method throw ValueError: Buffer dtype mismatch, expected 'double_t' but got 'float'

Precomputing the cosine similarity and then using metric='precomputed' throws the same error

[image: screen shot 2017-03-08 at 12 30 17 pm] https://cloud.githubusercontent.com/assets/2345378/23715732/9b9809ee- 03fb-11e7-9ecd-0c2a2cf0a91e.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/ 69#issuecomment-285112451, or mute the thread https://github.com/notifications/unsubscribe-auth/ ALaKBVOMKBGB5NqenBPQ0uNNTJEk6BlOks5rjuhagaJpZM4KjDT- .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ scikit-learn-contrib/hdbscan/issues/69#issuecomment-285145306, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ ACPJonte1GIcknVapmpiDNg5qLaNi4z0ks5rjwOJgaJpZM4KjDT-.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285145993, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBZDkrf3avatOISdRwOEQIvgWZLwcks5rjwQfgaJpZM4KjDT- .

oxymor0n commented 7 years ago

What ideas have you explored/worked on?

lmcinnes commented 7 years ago

One can make custom kd-trees and ball-trees that support angular distance (I have done this, but not folded it in -- I don't want to support essentially duplicates of code from sklearn). One can just make a normalized copy of the data and then run everything with euclidean distance (easy, but some other results will potentially look weird/not match with input data, for instance cluster exemplars etc., and the prediction will break). One can require distance matrices, and then fix up the bug to ensure everything gets recast as floats before going into Cython.

On Wed, Mar 8, 2017 at 2:42 PM, William Tran notifications@github.com wrote:

What ideas have you explored/worked on?

From: Leland McInnes notifications@github.com Reply-To: scikit-learn-contrib/hdbscan reply@reply.github.com Date: Wednesday, March 8, 2017 at 2:41 PM To: scikit-learn-contrib/hdbscan hdbscan@noreply.github.com Cc: William Tran viet.tran@outlook.com, Comment < comment@noreply.github.com> Subject: Re: [scikit-learn-contrib/hdbscan] Can't seem to get cosine metric working. (#69)

If you can figure out the "right" way to make it work I would appreciate it. There are a few approaches and but they each have some drawbacks. I would be happy to know if there is a "better" approach. Pull requests are even better.

On Wed, Mar 8, 2017 at 2:40 PM, William Tran notifications@github.com wrote:

Thanks for the hint, I totally missed that part! :D

Is there anything I could do to help?

From: Leland McInnes notifications@github.com Reply-To: scikit-learn-contrib/hdbscan reply@reply.github.com Date: Wednesday, March 8, 2017 at 2:37 PM To: scikit-learn-contrib/hdbscan hdbscan@noreply.github.com Cc: William Tran viet.tran@outlook.com, Comment < comment@noreply.github.com> Subject: Re: [scikit-learn-contrib/hdbscan] Can't seem to get cosine metric working. (#69)

There remain some issues with this -- it's still being worked on. Apologies. In the meantime it is worth noting that if you normalize your vectors then euclidean distance is equal to arccos distance and all is well, so that's the right way to proceed for now.

On Wed, Mar 8, 2017 at 12:41 PM, William Tran notifications@github.com

wrote:

So I'm also trying to use cosine similarity as my distance metric. Here's what I've tried so far:

-

Straight up using metric='cosine' doesn't work at all because it's not implemented in sklearn.neighbors.DistanceMetric http://scikit-learn.org/stable/modules/generated/sklearn.neighbors. DistanceMetric.html

Using algorithm='generic', metric='cosine' make the .fit method throw ValueError: Buffer dtype mismatch, expected 'double_t' but got 'float'

Precomputing the cosine similarity and then using metric='precomputed' throws the same error

[image: screen shot 2017-03-08 at 12 30 17 pm] https://cloud.githubusercontent.com/assets/2345378/23715732/9b9809ee- 03fb-11e7-9ecd-0c2a2cf0a91e.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/ 69#issuecomment-285112451, or mute the thread https://github.com/notifications/unsubscribe-auth/ ALaKBVOMKBGB5NqenBPQ0uNNTJEk6BlOks5rjuhagaJpZM4KjDT- .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ scikit-learn-contrib/hdbscan/issues/69#issuecomment-285145306, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ ACPJonte1GIcknVapmpiDNg5qLaNi4z0ks5rjwOJgaJpZM4KjDT-.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/ 69#issuecomment-285145993, or mute the thread https://github.com/notifications/unsubscribe-auth/ ALaKBZDkrf3avatOISdRwOEQIvgWZLwcks5rjwQfgaJpZM4KjDT- .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ scikit-learn-contrib/hdbscan/issues/69#issuecomment-285146345, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ ACPJohEU0ipIDGB2KmNqIzRQtkLlqe7Hks5rjwRkgaJpZM4KjDT-.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285146784, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBc1fFwXUBeoFtVjhXsyZgwcYDfLGks5rjwS-gaJpZM4KjDT- .

oxymor0n commented 7 years ago

Personally I feel that merging your custom trees into sklearn proper will be the best approach, but I don't know enough about the specific implementation differences that could block the merge to comment further on that.

For the normalized copy approach, if the different vector values is a problem, maybe we could keep a scaling vector so we can recover the unscaled input data for operations that require it?

As for the Cython bug, I double checked my code today and it seems casting the input matrix to np.double actually works, so that could be another way to do it. The downside of that is that the cast cause memory usage to go through the roof (maxed out my 16GB when I clustered a (25000, 300) input with algorithm='generic', metric='cosine')

lmcinnes commented 7 years ago

I understand, but that is simply one of the issues with computing an all-pairs distance matrix. I'll see if I can work out something else reasonable.

On Thu, Mar 9, 2017 at 11:54 AM, William Tran notifications@github.com wrote:

Personally I feel that merging your custom trees into sklearn proper will be the best approach, but I don't know enough about the specific implementation differences and whether any of them is blocking that merge to comment further on that.

For the normalized copy approach, if the different vector values is a problem, maybe we could keep a scaling vector so we can recover the unscaled input data for operations that require it?

As for the Cython bug, I double checked my code today and it seems casting the input matrix to np.double actually works, so that could be another way to do it. The downside of that is that the cast cause memory usage to go through the roof (maxed out my 16GB when I clustered a (25000, 300) input with algorithm='generic', metric='cosine')

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285410595, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBcSkkWcfb9aqItXjtmiMCpXTVNt-ks5rkC7JgaJpZM4KjDT- .

oxymor0n commented 7 years ago

@lmcinnes sorry which "issue" out of the three do you mean?

lmcinnes commented 7 years ago

The casting to doubles -- in general pairwise distance matrices are expensive and not a good way to go if you can help it. I should make that easier however.

On Thu, Mar 9, 2017 at 1:36 PM, William Tran notifications@github.com wrote:

@lmcinnes https://github.com/lmcinnes sorry which "issue" out of the three do you mean?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/69#issuecomment-285439752, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBbaEOFiVxhxT0u2IbUrjYm2xzMn2ks5rkEaxgaJpZM4KjDT- .

zyrikby commented 7 years ago

This seems working for me:

 pw_distance = pairwise_distances(df[col_names], metric="cosine", n_jobs=-1)
 clusterer = HDBSCAN(min_cluster_size=min_cluster_size, metric='precomputed')
 clusterer.fit(pw_distance)
KCraw commented 7 years ago

The above didn't work for me, but this did...

distance = pairwise_distances(X, metric='cosine')
clusterer = HDBSCAN(min_cluster_size=min_cluster_size, metric='precomputed')
db = clusterer.fit(distance.astype('float64'))
SundareshPrasanna commented 7 years ago

Hi @lmcinnes .. I still find this issue occurring .. I tried computing pairwise_distance over Cosine distance and clustering based on that but it is highly memory inefficient. Do you have any workaround to this ? thanks :)

lmcinnes commented 7 years ago

Sadly at this time, no. Current efficient methods require the use of space trees which, in turn, require metrics that admit the triangle inequality. Unfortunately this is not true of cosine distance, and hence things break down quickly. I am working on other approaches that would allow for more generic measures, but I honestly can't say when they will be available. In the meantime there is a semi-workaround available.

The next best thing to do is to use angular distance instead of cosine distance -- they are similar, but angular distance admits the triangle inequality. Now angular distance is not supported out of the box either (using the space trees in sklearn), but euclidean distance on l2 normalized vectors is equivalent to angular distance, so you can normalize the vectors and then use euclidean distance: thus something like

from sklearn.preprocessing import normalize

norm_data = normalize(data, norm='l2')
clusterer = HDBSCAN() # euclidean distance is the default
clusterer.fit(norm_data)

and that will be an approximation to using cosine distance.

SundareshPrasanna commented 7 years ago

thank you for that wonderful explanation @lmcinnes .. A pretty noob question from my end but the application i am looking into is to find clusters of similar document vectors (inferred from a bigger pretrained one). Do you think in such scenario normalizing vector to accommodate triangular inequality will do the trick ?

lmcinnes commented 7 years ago

There are no noob questions. I honestly can't say for sure -- it depends on how the document vectors were generated. If they are the result of PCA (or similar) on a TFIDF matrix then you should be fine. If they are the result of doc2vec or similar then it should work, but you may want to consider doing a non-linear dimension reduction first. If you are up for experimental code then UMAP might be one such option (by default it embeds to 2 dimensions, but supports arbitrary embedding dimension; something like 10 or 20 might be a reasonable thing to try).

To be honest I really don't know for sure -- there's not much you can do but try. On that front, please let me know what, if anything, works!

SundareshPrasanna commented 7 years ago

Thanks a lot @lmcinnes .. I'll try my hand on UMAP. I did try it out with the l2 normalized vector and i am pleasantly surprised as they too produce very similar cluster results to the one i got using precomputed cosine metric (+ the advantage of no additional RAM consumption).

But one thing i observed (and is not related to this issue) is that of the CPU usage. I am observing it is taking a long time to compute the clusters and also that my cpu usage is below 20% throughout. below is my code: clusterer1 = hdbscan.HDBSCAN(min_cluster_size=40,min_samples=10, cluster_selection_method='leaf', core_dist_n_jobs=-1) db = clusterer1.fit(norm_data)

am i doing something wrong?

lmcinnes commented 7 years ago

Nothing jumps out at me. The likely cause is that most of the time is spent shuffling memory around to and from the CPU. These days memory bandwidth to the processor is often the actual bottleneck, and without good cache oblivious algorithms or significant tuning it is hard to fix that,.

manu-dikzit commented 6 years ago

@lmcinnes I was wondering if it will make a difference if the normalization is done along features or samples?

lmcinnes commented 6 years ago

You would need to normalize across samples to make euclidean approximately the same as cosine -- you are essentially projecting all the samples onto a unit n-sphere, and then (small) euclidean distances are essentially angular distances around the n-sphere.

manu-dikzit commented 6 years ago

That explains it! thanks 👍

YiQ-Zhao commented 5 years ago

One workaround worked for me is to convert the matrix to a sparse matrix, and cosine should work.

eduamf commented 2 years ago

arccos

sklearn.metrics._dist_metrics.DistanceMetric.get_metric

ValueError: Unrecognized metric 'arccos'