scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 504 forks source link

Prediction Data Generation Fails w/ a Warning #572

Open OMirzaei opened 2 years ago

OMirzaei commented 2 years ago

Hello,

I have the following distance matrix (dist_matrix.npy.zip) that I calculated using some function.

from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
dist_condensed = pdist(X, metric = lambda u, v: calc_distance(u[0], v[0]))
dist_matrix = squareform(dist_condensed)

Then, I fitted a model using the attached distance matrix as follows:

import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size = 2, min_samples = 2, metric = 'precomputed', prediction_data = True)
clusterer.fit(dist_matrix)

After running the above command, I got the following warning (which looks like to be important if you want to predict some data in a later time):

hdbscan/hdbscan_.py:1256: UserWarning: Cannot generate prediction data for non-vectorspace inputs -- access to the source data rather than mere distances is required!

Also, I'd like to predict the cluster of some new data points at a later time by passing a distance matrix. Does the approximate_predict method accept a distance matrix (because I used a custom distance matrix originally)? I believe that's not the case, at least based on the documentation.

I even tried to see whether the provided example (see here) works but I got the same warning (see below).

Screen Shot 2022-10-27 at 3 58 14 PM

I appreciate it if you help me understand why I get that warning originally and how I can use the above method to predict the cluster of new data points in the future.

summer-tt commented 1 year ago

It seems like that when "precomputed" or some callable is used as the metric parameter, it would occur.

I do think not supporting "precomputed" is reasonable, but callable metric should be supported.

amunozj commented 3 weeks ago

Any news on this? it would be amazing to do soft clustering on callable matrix