scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

all_points_membership_vectors and membership_vectors return nan probabilities #295

Closed alberto-sibner closed 5 years ago

alberto-sibner commented 5 years ago

Hi,

I'm using hdbscan with haversine metric to find clusters based on latitudes and longitudes. The algorithm works really well for me. However, when I use all_points_membership_vectorsand membership_vectors with some coordinates these methods return nan probabilities.

In other words, I have N points in my dataset and they are all classified quite well using soft clustering. Although for a small part of these N points I get NAN probabilities of them belonging to any of the clusters.

I have checked some of these points individually and they seem totally normal for me, they are surrounded by a lot of clusters in the map and yet they don't have probabilities of belonging to any of them.

alberto-sibner commented 5 years ago

Any ideas about in which situations this might happen?

Thank you very much

EDIT:

As an additional note, during the training stage the following Warning arises

\venv\lib\site-packages\hdbscan\prediction.py:547: RuntimeWarning: invalid value encountered in double_scalars
  clusterer.prediction_data_.cluster_tree)

I could also share my array of latitudes and longitudes if it was necessary.

EDIT 2:

I also noticed that there are points whose vectors of probabilities sum 0 (i.e. do not belong to any clusters either) when they shouldn't

alberto-sibner commented 5 years ago

Hi @lmcinnes

Any ideas about why this might be happening?

Thanks in advance

lmcinnes commented 5 years ago

At this point I must admit that I am not sure quite why the soft cluster membership functions fail -- it has been a while since I wrote them, and there seem to be odd corner cases that trigger behaviour, but I can rarely reproduce it, so while afew things have been fixed, niggling issues remain, for which I don't really have any ideas.

alberto-sibner commented 5 years ago

@lmcinnes Oh, I see what you mean... Fair enough! Do you think I could help you fixing this issue? Soft clustering is very important for the problems I need to solve and hdbscan is the algorithm that work best for me after having tried a lot.

I have the exact parameters that trigger this behaviour for the dataset I use (and also a trained model if you want). Could I share them with you privately and help you debugging this?

Thank you very much!

alberto-sibner commented 5 years ago

I've managed to solve it. In my case, it seems the problem was due to having a few points duplicated in the long lists I wanted to find clusters in.