scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 502 forks source link

Haversine cannot be used with prediction data #88

Open mratsim opened 7 years ago

mratsim commented 7 years ago

When trying to use both the haversine metric and prediction data I get the error: ValueError: metric HaversineDistance is not valid for KDTree

Steps to reproduce, coords is a latitude,longitude dataframe:

db = HDBSCAN(min_samples=1,
                metric='haversine',
                core_dist_n_jobs=-1,
                prediction_data=True
               )

db.fit(coords)

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-242-133ab39cb2a7> in <module>()
----> 1 db.fit(coords)

/usr/lib/python3.6/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    855 
    856         if self.prediction_data:
--> 857             self.generate_prediction_data()
    858 
    859         return self

/usr/lib/python3.6/site-packages/hdbscan/hdbscan_.py in generate_prediction_data(self)
    889                 self._raw_data, self.condensed_tree_, min_samples,
    890                 tree_type='kdtree', metric=self.metric,
--> 891                 **self._metric_kwargs
    892             )
    893         else:

/usr/lib/python3.6/site-packages/hdbscan/prediction.py in __init__(self, data, condensed_tree, min_samples, tree_type, metric, **kwargs)
     99         self.raw_data = data
    100         self.tree = self._tree_type_map[tree_type](self.raw_data,
--> 101                                                    metric=metric, **kwargs)
    102         self.core_distances = self.tree.query(data, k=min_samples)[0][:, -1]
    103         self.dist_metric = DistanceMetric.get_metric(metric, **kwargs)

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.__init__ (sklearn/neighbors/kd_tree.c:9328)()

ValueError: metric HaversineDistance is not valid for KDTree

I tried changing the default algorithm from best to prims_balltree and boruvka_balltree but to no avail.

I found the issue at line 890 of hdbscan_.py, with the tree type hardcoded to kdtree.

/usr/lib/python3.6/site-packages/hdbscan/hdbscan_.py in generate_prediction_data(self)
    889                 self._raw_data, self.condensed_tree_, min_samples,
    890                 tree_type='kdtree', metric=self.metric,

The PredictionData in prediction.py supports balltree, I confirmed it works and I can now use the (undocumented) approximate_predict function.

I am not sure of the implication of changing the default from kdtree to balltree.

lmcinnes commented 7 years ago

Thanks for this, it's definitely a bug. The prediction apparatus is still new, so this is exactly the sort of things I need to hammer out. I'll try and get this corrected for you shortly.

On Sat, Feb 25, 2017 at 8:37 AM, Mamy Ratsimbazafy <notifications@github.com

wrote:

When trying to use both the haversine metric and prediction data I get the error: ValueError: metric HaversineDistance is not valid for KDTree

It seems that scikit-learn

Steps to reproduce, coords is a latitude,longitude dataframe:

db = HDBSCAN(min_samples=1, metric='haversine', core_dist_n_jobs=-1, prediction_data=True )

db.fit(coords)

Error:

---------------------------------------------------------------------------ValueError Traceback (most recent call last) in ()----> 1 db.fit(coords) /usr/lib/python3.6/site-packages/hdbscan/hdbscan_.py in fit(self, X, y) 855 856 if self.prediction_data:--> 857 self.generate_predictiondata() 858 859 return self /usr/lib/python3.6/site-packages/hdbscan/hdbscan.py in generate_prediction_data(self) 889 self._raw_data, self.condensedtree, min_samples, 890 tree_type='kdtree', metric=self.metric,--> 891 self._metric_kwargs 892 ) 893 else: /usr/lib/python3.6/site-packages/hdbscan/prediction.py in init(self, data, condensed_tree, min_samples, tree_type, metric, kwargs) 99 self.raw_data = data 100 self.tree = self._tree_type_map[tree_type](self.raw_data,--> 101 metric=metric, kwargs) 102 self.core_distances = self.tree.query(data, k=min_samples)[0][:, -1] 103 self.dist_metric = DistanceMetric.get_metric(metric, kwargs)

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.init (sklearn/neighbors/kd_tree.c:9328)() ValueError: metric HaversineDistance is not valid for KDTree

I tried changing the algorithm to prims_balltree and boruvka_balltree but to no avail.

I found the issue at line 890 of hdbscan_.py, with the tree type hardcoded to kdtree.

/usr/lib/python3.6/site-packages/hdbscan/hdbscan_.py in generate_prediction_data(self) 889 self._raw_data, self.condensedtree, min_samples, 890 tree_type='kdtree', metric=self.metric,

The PredictionData in prediction.py supports balltree, I confirmed it works. I am not sure of the implication of changing the default from kdtree to balltree.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/88, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBUzkm2_F36FoA05tOzR-OSBi7Q0zks5rgC6CgaJpZM4MMA4p .

lmcinnes commented 7 years ago

Should be fixed now. Sorry about that!