Open lesshaste opened 3 years ago
In the current implementation it needs to be a distance supported by sklearn's BallTree or KDTree structures, which Levenshtein is not. You can, however, use sparse precomputed distance matrices (using the scipy.sparse format). Thus you could compute distances to k nearest neighbors (for a sufficiently large value of k) and use that.
How could you work out what the k nearest neighbors are?
I think you would want to use an approximate nearest neighbor library than supports levenshtein distance. I believe hnsw in nmslib meets those criteria.
Is it possible to use HDBSCAN with the Levenshtein distance? My dataset is too large to make a full distance matrix to feed into it.
The Levenshtein distance satisfies the triangle inequality which I understand is one requirement. Is this possible?