scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

Is it possible to use HDBSCAN with the Levenshtein distance? #464

Open lesshaste opened 3 years ago

lesshaste commented 3 years ago

Is it possible to use HDBSCAN with the Levenshtein distance? My dataset is too large to make a full distance matrix to feed into it.

The Levenshtein distance satisfies the triangle inequality which I understand is one requirement. Is this possible?

lmcinnes commented 3 years ago

In the current implementation it needs to be a distance supported by sklearn's BallTree or KDTree structures, which Levenshtein is not. You can, however, use sparse precomputed distance matrices (using the scipy.sparse format). Thus you could compute distances to k nearest neighbors (for a sufficiently large value of k) and use that.

lesshaste commented 3 years ago

How could you work out what the k nearest neighbors are?

lmcinnes commented 3 years ago

I think you would want to use an approximate nearest neighbor library than supports levenshtein distance. I believe hnsw in nmslib meets those criteria.