scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 504 forks source link

exemplars not returned when using precomputed distance matrix #251

Open codata-hg opened 6 years ago

codata-hg commented 6 years ago

When I used a precomputed distance matrix, and try to get the exemplars of the clusterer by

clusterer.exemplars_

I have the following error message 'AttributeError: Currently exemplars require the use of vector input data with a suitable metric. This will likely change in the future, but for now no exemplars can be provided'. I don't understand why exemplar has to be generated in prediction part which rely on KDTree or BallTree, if clustering is already done. Any idea? Thanks

lmcinnes commented 6 years ago

One needs to compute nearest neighbors of the new data points so one can approximate the core distance of the points -- that generally requires trees, or some other nearest neighbor technique. Precomputed distance matrices don't work so well for that. Just computing the exemplars is possible, but all that code got wrapped up together for now. You can effectively reproduce the exemplar computation yourself using the condensed tree representation if you wish. You simply want the set of the most persistent points for each selected cluster.

codata-hg commented 6 years ago

Thanks for you reply. That makes sense to me. Like you said, I followed How Soft Clustering for HDBSCAN Works and had the exemplars reproduced successfully.

But I still think it would be great to have clusterer.exemplars_ available, maybe by separating exemplars generation out from prediction. Shouldn't be hard.

lmcinnes commented 6 years ago

I would be very happy to recieve a pull request -- I don't think it is too hard, but I don't have time to work on it right now. If you can make it work that would be great!

On Fri, Nov 9, 2018 at 6:42 PM codata-hg notifications@github.com wrote:

Thanks for you reply. That makes sense to me. Like you said, I followed How Soft Clustering for HDBSCAN Works https://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html and had the exemplars reproduced successfully.

But I still think it might be better to separate exemplars generation out from prediction, and make clusterer.exemplars_ available. Shouldn't be hard. I find HDBSCAN really awesome, I'd love to contribute if it's needed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/251#issuecomment-437529770, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBQd2Vc2ksI2bD4LqexpH80eWtxheks5uthLLgaJpZM4YVXDL .

codata-hg commented 6 years ago

HDBSCAN is a really awesome clustering technique! I'd love to make any contribution. Even though I cannot have a guaranteed timeline for it, I'll try to make it.

lmcinnes commented 6 years ago

Thanks, anything you can manage is greatly appreciated.