scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 504 forks source link

Soft Clustering with precomputed distance matrix #128

Open IlyaOrson opened 7 years ago

IlyaOrson commented 7 years ago

Hello! First of all, thanks a lot for this clustering method and the implementation, both are super cool!

I am trying to use Soft Clustering with the precomputed distance matrix since I am using an unconventional distance. There appears to be no method implemented for this right now. I understand this is a new experimental feature and wondered if this limitation is just temporary. Is it possible to add this functionality?

Just for reference, the following code build from the manual warns this:

from sklearn.datasets import make_blobs
import pandas as pd
blobs, labels = make_blobs(n_samples=2000, n_features=10)
pd.DataFrame(blobs).head()

from sklearn.metrics.pairwise import pairwise_distances
distance_matrix = pairwise_distances(blobs)
clusterer = hdbscan.HDBSCAN(metric='precomputed',
                             prediction_data=True)
clusterer.fit(distance_matrix)
clusterer.labels_

UserWarning: Cannot generate prediction data for non-vectorspace inputs -- access to the source data ratherthan mere distances is required!
lmcinnes commented 7 years ago

I believe it is a relatively fundamental obstruction at the present time. There may be some cases where it could be made to work, but I would have to think carefully about how best to build an API that would allow for that without being confusing for all the other cases. Sorry that I can't provide any better answers at this time.

IlyaOrson commented 7 years ago

No rush at all, I will stay tuned. Thanks for this again!

nlassaux commented 7 years ago

Hi! Thank you for all your work!

Is it the same with callables? Because I tried to execute the following code:

def userdist(x, y):
    distance = vincenty((x[0], x[1]), (y[0], y[1]), miles=True)
    return distance

clusterer = hdbscan.HDBSCAN(min_cluster_size=6,
                            min_samples=3,
                            metric=userdist,
                            prediction_data=True).fit(data[['latitude', 'longitude']]) 

I don't have any warning, but when I call all_points_membership_vectors(clusterer) on it, I notice that clusterer.prediction_data_ is None.

The error I have is the following:

/Users/nlassaux/hdbscan-clustering/env/lib/python2.7/site-packages/hdbscan/prediction.pyc in all_points_membership_vectors(clusterer)
    514     clusters = np.array(list(clusterer.condensed_tree_._select_clusters()
    515                              )).astype(np.intp)
--> 516     all_points = clusterer.prediction_data_.raw_data
    517 
    518     distance_vecs = all_points_dist_membership_vector(all_points,
AttributeError: 'NoneType' object has no attribute 'raw_data'

Can you explain why a custom metric is a special case for getting a soft clustering?

lmcinnes commented 7 years ago

The soft clustering is still fairly new, and I haven't pushed everything through properly. For now I'm making heavy use of sklearn's KDTree and BallTree, and while they support custom metrics they aren't explicitly cited in the allowed metrics, which is the easiest way to check if they can reasonably be used. That means that the algorithm falls back to other approaches, which don't support the soft clustering at this time.

lmcinnes commented 7 years ago

If you could add an issue with a feature request to ensure that callable metrics are supported for soft clustering I would appreciate it -- it will help stop this falling through the cracks later.

elena-sharova commented 6 years ago

Hello,

Thank you for developing such a great clustering library.

It would be really useful to have this feature available in the next release of hdbscan.

It would be great to have either the ability to use approximate_predict or membership_vector for a custom distance measure or being able to use the same methods for a pairwise_distance input.

Could I ask if there are any plans for this functionality to be added?

Thank you, Elena

lmcinnes commented 6 years ago

My current priorities are in developing a follow on clustering library that benefits from some newer theory and a lot of lessons learned from this library. Particularly when it comes to soft clustering this is very much the case. That means that in practice I do not have any near term plans to add such functionality myself. I would be more than happy to accept pull requests that add such functionality.

danielgeiszler commented 4 years ago

Hi @lmcinnes ,

Was there any progress into getting prediction working for precomputed distances?

warrior-galaxy commented 4 years ago

Hello,

Any progress on getting prediction working with precomputed distance? I calculated cosine distance since it is not supported, but when I try predicting it does not work.

lmcinnes commented 4 years ago

It is unlikely to be available for precomputed distances any time soon. Sorry.

kr-hansen commented 2 years ago

I get this same error

UserWarning: Cannot generate prediction data for non-vectorspace inputs -- access to the source data ratherthan mere distances is required!

when using sqeuclidean as my distance metric. Is that to be expected @lmcinnes? I'm guessing under the hood any of the scipy distances are just doing the same thing and calculating a pre-computed metric?

MH8775 commented 2 years ago

Checking again on the status (hopefully progress) on that thread- namely, using fuzzy/soft clustering with the precomputed distance matrix...