lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.45k stars 808 forks source link

Jaccard transformation function clarification #1015

Open Rridley7 opened 1 year ago

Rridley7 commented 1 year ago

I had a quick clarification question about the jaccard distance in this package vs. the scipy spatial version, when considering non-binary data. The version in this package: https://github.com/lmcinnes/umap/blob/5c79fa60ce536405339da227bfd885635b68735d/umap/distances.py#L382

jaccard(np.array([1,5,0,1]),np.array([1,1.45,0,1]))
## 0.0

The version in scipy

scipy.spatial.distance.jaccard(np.array([1,5,0,1]),np.array([1,1.45,0,1]))
## 0.333

When running UMAP, which of these versions is referenced when calculating distances via jaccard?

As an aside, I noticed that inputting a large matrix (~8M x 300) of non-binary data will run much faster than if it is first converted to binary observances, such as array.astype(bool).astype(int) . This is what led me to check for this difference between the two functions.

lmcinnes commented 1 year ago

When running umap it will either be the one you cite first, or this one from pynndescent for the most part. It is possible that for small datasets (the cutoff is a somewhat arbitrary 4096 samples) you may get the scipy version since in those cases UMAP just uses sklearn's pairwise_distances to compute the full distance matrix.

I'm not sure what it going on with the scipy version for the data you cite. You can provide a weight vector to do weighted jaccard, but that's a third argument so, to my mind, I can't see how you can get anything but a 0 jaccard distance since the two vectors, despite having different values, share exactly the same non-zeros.