scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 503 forks source link

Runtime warning in outlier detection #129

Open IlyaOrson opened 7 years ago

IlyaOrson commented 7 years ago

Hello! I am getting a lot of warnings of the following type in the latest tagged version:

...\Continuum\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py:930:
RuntimeWarning: invalid value encountered in double_scalars 
self._outlier_scores = outlier_scores(self._condensed_tree)

Does anyone know why this warning appears and how to avoid the problem?

lmcinnes commented 7 years ago

Most likely there as issue with NaNs creeping in somehow. This could be due to peculiarities of the dataset, particularly if you have more than min_cluster_size points that are all identical (although I believe many of those issues should be caught more elegantly now). Can you share the dataset?

IlyaOrson commented 7 years ago

I can't share the data set, but you are right, NaNs appear because I have the default min_cluster_size = 5 and hdbscan identifies two clusters where one has just three members. In my particular case all the values contained in each cluster are identical between them.

lmcinnes commented 7 years ago

You can potentially alleviate such a problem by adding a very small amount of noise to your data (well below the level of data distribution, just enough to jiggle identical points apart).

mihahauke commented 7 years ago

@lmcinnes It worked for me.