Open jolespin opened 4 months ago
This seems like a nice idea. WDYT @jjerphan , @Micky774 , @jeremiedbb ?
Indeed, this might be a good contribution to introduce as a strategy after #26983 gets in.
I do not think creating a structure like skbio.stats.distance.DistanceMatrix
would be helpful, at least from a memory footprint perspective as it seems it stores an entire dense matrix.
We could have a vector-form distance such as the one returned by scipy.spatial.distance.squareform
(i.e. which would store $\frac{n (n - 1)}{2}$ elements using a flat 1-D arrays) which would have an interface allowing accessing the distance between $x_i$ and $x_i$, $\forall (i,j) \in [0 .. n-1]^2$.
Edit: I actually did not understand your request at first. I just have changed the title of this issue so that it is more consistent. Ideally, we should not materialize this square form representation of the distance matrix or any other ones but have the algorithm support the boolean distance metrics (like Jaccard's) directly.
Out of curiosity, is there something blocking you from using:
X = # Data
clusterer = HDBSCAN(metric="jaccard")
clusterer.fit(X)
?
@jjerphan at the time of writing this I was benchmarking HDBSCAN
by comparing to some "gold-standard" labels for my dataset which includes >50k observations. It just seemed like a waste to recompute the distances each time so I wanted to use the precompute option and then I realized that it was using the redundant square form. When I took a deep dive into the other classes on sklearn, it looked like most required the precomputed distance to be in the redundant square form.
Is the square form the redundant form or the 1d non-redundant form? I thought it was the former but the change in title makes me think otherwise.
I meant vector-form and adapted the title accordingly again.
I was confused by scipy.spatial.distance.squareform
which converts the one in the other and vice versa.
This sounds worth doing, however I agree that perhaps the best way forwards is through #26983, and maybe a buffer wrapper so that we can have e.g. FullDistance
to wrap the vector-form distance, allowing existing infrastructure to treat FullDistance
as a redundant squareform matrix without materializing extra memory. This would come at the expense of some performance due to indirection, but IMO a slightly lower-performance, lower-memory option here sounds like a good addition. I'm curious what folks think.
Would it make sense to force full distance matrices (ie redundant and square) into non-redundant vector form? To lower the memory footprint overall or to allow full distances if provided to increase performance at the cost of memory?
I don't think we'd want to force the vector form. Most usecases don't mind the memory footprint.
Also, most use cases do not have a symmetric distance matrix as well as generally X_train != X_test
Describe the workflow you want to enable
I would like to calculate the upper triangle of a distance from
scipy.spatial.distance.pdist
instead of the redundant and memory intensive version fromsklearn.metrics.pairwise_distances
which has $0.5 * (N^2 - N)$ values and use this as input tometric="precompute"
Describe your proposed solution
In addition to the original functionality (or ideally replacing):
Describe alternatives you've considered, if relevant
Using the redundant square form but this requires a lot more memory that isn't necessary.
Additional context
It may be worthwhile creating a
DistanceMatrix
object like https://scikit.bio/docs/latest/generated/skbio.stats.distance.DistanceMatrix.html#skbio.stats.distance.DistanceMatrix