scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.55k stars 25.3k forks source link

FEAT Allow the vector-form representation of symetric distance matrices as input #29133

Open jolespin opened 4 months ago

jolespin commented 4 months ago

Describe the workflow you want to enable

I would like to calculate the upper triangle of a distance from scipy.spatial.distance.pdist instead of the redundant and memory intensive version from sklearn.metrics.pairwise_distances which has $0.5 * (N^2 - N)$ values and use this as input to metric="precompute"

Describe your proposed solution

from scipy.spatial.distance import pdist
X = # Data
dist = pdist(X, metric="jaccard") # Or any other but I just jaccard a lot for my boolean datasets.  Just didn't want to use euclidean/cosine 
clusterer = HDBSCAN(metric="precomputed_triangle")
clusterer.fit(dist)

In addition to the original functionality (or ideally replacing):

from scipy.spatial.distance import pdist, squareform
X = # Data
dist = squareform(pdist(X, metric="jaccard")) # Or any other but I just jaccard a lot for my boolean datasets.  Just didn't want to use euclidean/cosine 
clusterer = HDBSCAN(metric="precomputed")
clusterer.fit(dist)

Describe alternatives you've considered, if relevant

Using the redundant square form but this requires a lot more memory that isn't necessary.

Additional context

It may be worthwhile creating a DistanceMatrix object like https://scikit.bio/docs/latest/generated/skbio.stats.distance.DistanceMatrix.html#skbio.stats.distance.DistanceMatrix

adrinjalali commented 3 months ago

This seems like a nice idea. WDYT @jjerphan , @Micky774 , @jeremiedbb ?

jjerphan commented 3 months ago

Indeed, this might be a good contribution to introduce as a strategy after #26983 gets in.

I do not think creating a structure like skbio.stats.distance.DistanceMatrix would be helpful, at least from a memory footprint perspective as it seems it stores an entire dense matrix.

We could have a vector-form distance such as the one returned by scipy.spatial.distance.squareform (i.e. which would store $\frac{n (n - 1)}{2}$ elements using a flat 1-D arrays) which would have an interface allowing accessing the distance between $x_i$ and $x_i$, $\forall (i,j) \in [0 .. n-1]^2$.


Edit: I actually did not understand your request at first. I just have changed the title of this issue so that it is more consistent. Ideally, we should not materialize this square form representation of the distance matrix or any other ones but have the algorithm support the boolean distance metrics (like Jaccard's) directly.

Out of curiosity, is there something blocking you from using:

X = # Data
clusterer = HDBSCAN(metric="jaccard")
clusterer.fit(X)

?

jolespin commented 3 months ago

@jjerphan at the time of writing this I was benchmarking HDBSCAN by comparing to some "gold-standard" labels for my dataset which includes >50k observations. It just seemed like a waste to recompute the distances each time so I wanted to use the precompute option and then I realized that it was using the redundant square form. When I took a deep dive into the other classes on sklearn, it looked like most required the precomputed distance to be in the redundant square form.

Is the square form the redundant form or the 1d non-redundant form? I thought it was the former but the change in title makes me think otherwise.

jjerphan commented 3 months ago

I meant vector-form and adapted the title accordingly again.

I was confused by scipy.spatial.distance.squareform which converts the one in the other and vice versa.

Micky774 commented 3 months ago

This sounds worth doing, however I agree that perhaps the best way forwards is through #26983, and maybe a buffer wrapper so that we can have e.g. FullDistance to wrap the vector-form distance, allowing existing infrastructure to treat FullDistance as a redundant squareform matrix without materializing extra memory. This would come at the expense of some performance due to indirection, but IMO a slightly lower-performance, lower-memory option here sounds like a good addition. I'm curious what folks think.

jolespin commented 3 months ago

Would it make sense to force full distance matrices (ie redundant and square) into non-redundant vector form? To lower the memory footprint overall or to allow full distances if provided to increase performance at the cost of memory?

adrinjalali commented 3 months ago

I don't think we'd want to force the vector form. Most usecases don't mind the memory footprint.

jjerphan commented 3 months ago

Also, most use cases do not have a symmetric distance matrix as well as generally X_train != X_test