Open Rridley7 opened 1 year ago
When running umap it will either be the one you cite first, or this one from pynndescent for the most part. It is possible that for small datasets (the cutoff is a somewhat arbitrary 4096 samples) you may get the scipy version since in those cases UMAP just uses sklearn's pairwise_distances
to compute the full distance matrix.
I'm not sure what it going on with the scipy version for the data you cite. You can provide a weight vector to do weighted jaccard, but that's a third argument so, to my mind, I can't see how you can get anything but a 0 jaccard distance since the two vectors, despite having different values, share exactly the same non-zeros.
I had a quick clarification question about the jaccard distance in this package vs. the scipy spatial version, when considering non-binary data. The version in this package: https://github.com/lmcinnes/umap/blob/5c79fa60ce536405339da227bfd885635b68735d/umap/distances.py#L382
The version in scipy
When running UMAP, which of these versions is referenced when calculating distances via jaccard?
As an aside, I noticed that inputting a large matrix (~8M x 300) of non-binary data will run much faster than if it is first converted to binary observances, such as
array.astype(bool).astype(int)
. This is what led me to check for this difference between the two functions.