scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

Using Dynamic Time Warping as distance metric with prediction=True #180

Open EMGrua opened 6 years ago

EMGrua commented 6 years ago

I would like to first thank you all for the very useful library you have created, thank you very much!

To the question: I am using HDBScan to cluster data that I am generating from a simulation that a colleague and I have created. The simulation generates agents, and each agent can perform several activities in a day over the course of the simulation. Given that there is an important temporal factor to the data we want to use Dynamic Time Warping (dtw) as a metric. Furthermore, we need every agent to be placed within a cluster. Because of this I have been using the hdbscan.HDBSCAN() function with prediction_data=True. Here comes the problem: if I put prediction_data=True we can't use 'precomputed' as a parameter for metric and it seems like introducing our dtw function in metric as a 'pyfunc' doesn't work either. What do you advice as a solution to this problem? I was thinking of going in the dist_metric.pyx and adding the dtw function there, do you think that could work?

Cheers!

lmcinnes commented 6 years ago

I think, right now, that isn't really going to be possible. The prediction data relies on fast nearest neighbors which (for now) relies on sklearn's KDTrees and BallTrees, and you are largely stuck with the metrics they have implemented (which are many, but don't include DTW). I am working on some other projects which will hopefully help remedy this (PyNNDescent, which should allow arbitrary metrics for nearest neighbor computation), but none of that is mature enough to rewrite HDBSCAN around yet.

Ultimately if you need every point to be in a cluster you really need a partitioning function, and it is quite possible that HDBSCAN is not the right algorithm for this. Instead you could look at RobustSingleLinkage and find a good cut through the three, or something like Gaussian Mixture Models. Sorry I can't be more help.

EMGrua commented 6 years ago

That's a bummer, I really liked using your tool. Anyhow, thank you very much for the quick reply and good luck with your other projects!

m-dz commented 6 years ago

@EMGrua , regarding your the second issue, i.e. clustering every single agent, we managed to work around that by using the all_points_dist_membership_vector() function (imported from hdbscan._prediction_utils) directly with cluster exemplars. This is not ideal, but the "density" flavour is somehow preserved as exemplars tend to follow cluster shapes quite fine. Also, it's much faster than using the "proper" all_points_membership_vectors() function. Worth a try.

alexgan04 commented 1 month ago

First, thanks for the best and most accessible hdbscan implementation! Any updates on the issue of working with DTW as the distance metric?