scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 503 forks source link

dist_metrics error with default settings #139

Open architec997 opened 6 years ago

architec997 commented 6 years ago

Producing a simple dataframe via

x = np.linspace(0,100,200)
y = np.arange(0,200)
xy, _ = np.meshgrid(x,y)
noise = 0.3*np.random.random((200,200))
series = np.sin(xy+5*noise) + noise
series [0,:] += 10*np.random.random(200)
data = pd.DataFrame(series)

I try to run HDBSCAN clustering with the default arguments

clusterer = hdbscan.HDBSCAN().fit(data)

And get the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-58-31e012db38f8> in <module>()
      1 from sklearn.cluster import DBSCAN
----> 2 clusterer = hdbscan.HDBSCAN().fit(data)

~/anaconda3/envs/py36/lib/python3.6/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    814          self._condensed_tree,
    815          self._single_linkage_tree,
--> 816          self._min_spanning_tree) = hdbscan(X, **kwargs)
    817 
    818         if self.prediction_data:

~/anaconda3/envs/py36/lib/python3.6/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    534                     _hdbscan_prims_kdtree)(X, min_samples, alpha,
    535                                            metric, p, leaf_size,
--> 536                                            gen_min_span_tree, **kwargs)
    537             else:
    538                 (single_linkage_tree, result_min_span_tree) = memory.cache(

~/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py in __call__(self, *args, **kwargs)
    360 
    361     def __call__(self, *args, **kwargs):
--> 362         return self.func(*args, **kwargs)
    363 
    364     def call_and_shelve(self, *args, **kwargs):

~/anaconda3/envs/py36/lib/python3.6/site-packages/hdbscan/hdbscan_.py in _hdbscan_prims_kdtree(X, min_samples, alpha, metric, p, leaf_size, gen_min_span_tree, **kwargs)
    168 
    169     # TO DO: Deal with p for minkowski appropriately
--> 170     dist_metric = DistanceMetric.get_metric(metric, **kwargs)
    171 
    172     # Get distance to kth nearest neighbour

TypeError: descriptor 'get_metric' requires a 'hdbscan.dist_metrics.DistanceMetric' object but received a 'str'

I tried explicitly specifying other metrics with metric = 'manhattan' etc argument, did not help

lmcinnes commented 6 years ago

I suspect this is an arg order issue in the code somewhere, possibly due to additions. This is a little disconcerting. Let me see if I can track this down later today.

lmcinnes commented 6 years ago

Sorry, I ran out of time today. I'll have to try and get to this a little later. My apologies for the delay.

farfan92 commented 6 years ago

Also getting "TypeError: descriptor 'get_metric' requires a 'hdbscan.dist_metrics.DistanceMetric' object but received a 'str'', even when just using the simple case in the documentation.

farfan92 commented 6 years ago

Error occurs with RobustSingleLinkage as well.

When trying to avoid the get_metric method receiving the string 'euclidean' or 'manhattan' etc. instead of the expected object, I used a precomputed distance matrix. Now getting:

`clusterer = hdbscan.HDBSCAN(min_cluster_size=5, min_samples=None, metric='precomputed').fit(gower_df)

NameError Traceback (most recent call last)

in () ----> 1 clusterer = hdbscan.HDBSCAN(min_cluster_size=5, min_samples=None, metric='precomputed').fit(D) C:\Users\centec7\AppData\Local\Continuum\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py in fit(self, X, y) 814 self._condensed_tree, 815 self._single_linkage_tree, --> 816 self._min_spanning_tree) = hdbscan(X, **kwargs) 817 818 if self.prediction_data: C:\Users\centec7\AppData\Local\Continuum\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs) 526 _hdbscan_generic)(X, min_samples, 527 alpha, metric, p, leaf_size, --> 528 gen_min_span_tree, **kwargs) 529 elif metric in KDTree.valid_metrics: 530 # TO DO: Need heuristic to decide when to go to boruvka; C:\Users\centec7\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py in __call__(self, *args, **kwargs) 281 return _load_output(self._output_dir, _get_func_fullname(self.func), 282 timestamp=self.timestamp, --> 283 metadata=self.metadata, mmap_mode=self.mmap_mode, 284 verbose=self.verbose) 285 C:\Users\centec7\AppData\Local\Continuum\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py in _hdbscan_generic(X, min_samples, alpha, metric, p, leaf_size, gen_min_span_tree, **kwargs) 85 min_samples, alpha) 86 ---> 87 min_spanning_tree = mst_linkage_core(mutual_reachability_) 88 89 # mst_linkage_core does not generate a full minimal spanning tree hdbscan/_hdbscan_linkage.pyx in hdbscan._hdbscan_linkage.mst_linkage_core (hdbscan\_hdbscan_linkage.c:2894)() hdbscan/_hdbscan_linkage.pyx in hdbscan._hdbscan_linkage.mst_linkage_core (hdbscan\_hdbscan_linkage.c:2281)() NameError: name 'np' is not defined `
lmcinnes commented 6 years ago

Sorry, I'm having trouble reproducing this. Can you tell me a little more about your setup?

architec997 commented 6 years ago

I also checked - I have the same error using a precomputed distance matrix as farfan92.

Ubuntu 17.10, Anaconda 5.0.1, Python 3.6. The versions of packages installed in my used venv are: packages in environment at /home/vladimir/anaconda3/envs/py36: #

farfan92 commented 6 years ago

Updating packages seems to have removed the NameError. (numpy and sklearn specifically). Must have been a compatibility issue, after installing some other packages.

lmcinnes commented 6 years ago

I'm glad at least one of you got this resolved. Hopefully refreshing/updating packages might work twice? I am honestly at a little bit of a loss here.

danielhelf commented 6 years ago

Getting the exact same error message here (descriptor 'get_metric' requires a 'hdbscan.dist_metrics.DistanceMetric' object but received a 'str') despite updating the packages.

Vanwalleghem commented 6 years ago

Got the same error message on an Ubuntu virtual machine with python 2.7 and a windows PC with python 3.6.4, both running the latest version of anaconda and having installed HDBscan through conda-forge. I may try to install it another way tomorrow

Vanwalleghem commented 6 years ago

Alright, I actually had some time so I tested that. On the same machine, the pip install hdbscan worked immediately (after I removed the conda-forge version). Hope it helps you narrow it down and/or to fix it for others

linwoodc3 commented 6 years ago

I also had this error, but it was only present in the conda-forge installed version of hdbscan. pip install version of hdbscan. I removed the conda-forge version, ran pip install hdbsan for my conda environment, and hdbscan works find.

lmcinnes commented 6 years ago

@linwoodc3 That's a little weird; the conda-forge version gets synced with the pip version regularly. Perhaps a conda upgrade umap-learn would have doen the job? Regardless, you have a working version now, and that's what counts. Thanks for the report, I'll keep an eye out for something amiss like this somewhere along the line.

kevinafra commented 5 years ago

I just got this same error (descriptor 'get_metric' requires a 'hdbscan.dist_metrics.DistanceMetric' object but received a 'str'). I installed hdbscan just yesterday via pip. I did notice that when I tried to import it, it gave me an error about 'numpy.core.multiarray failed to import' but no reason why. So I imported numpy.core.multiarray manually, and then I was able to import hdbscan. Don't know whether that is a related problem. But attempting to fit some data that I had just fit with sklearn.cluster.DBSCAN failed with the above error when I tried to do it with hdbscan. I have python 2.7.13 and numpy 1.11.2. 'pip check' doesn't find any broken dependencies. What else can I try? I would really like to use hdbscan, as I have data whose clusters are certain to have variable density. Does hdbscan require python 3.x perhaps, along with all of the dependent versions of numpy, Cython, etc.?