scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 599 forks source link

New UMAP distances like ll_dirichlet and hellinger throw exception with small AnnDatas #1011

Closed gokceneraslan closed 1 month ago

gokceneraslan commented 4 years ago

If adata.n_obs < 4096 and umap version >= 0.4 and if metric is a distance that is not supported by scikit-learn (like ll_dirichlet or hellinger), we get a ValueError:

Code for reproducing with UMAP >= 0.4:

import scanpy as sc

adata = sc.datasets.paul15()
sc.pp.neighbors(adata, metric='hellinger')
ValueError                                Traceback (most recent call last)
<ipython-input-5-e2c66b650fd3> in <module>
      2 
      3 adata = sc.datasets.paul15()
----> 4 sc.pp.neighbors(adata, metric='hellinger')

~/.anaconda3/lib/python3.7/site-packages/scanpy/neighbors/__init__.py in neighbors(adata, n_neighbors, n_pcs, use_rep, knn, random_state, method, metric, metric_kwds, copy)
    108         n_neighbors=n_neighbors, knn=knn, n_pcs=n_pcs, use_rep=use_rep,
    109         method=method, metric=metric, metric_kwds=metric_kwds,
--> 110         random_state=random_state,
    111     )
    112     adata.uns['neighbors'] = {}

~/.anaconda3/lib/python3.7/site-packages/scanpy/neighbors/__init__.py in compute_neighbors(self, n_neighbors, knn, n_pcs, use_rep, method, random_state, write_knn_indices, metric, metric_kwds)
    686             # non-euclidean case and approx nearest neighbors
    687             if X.shape[0] < 4096:
--> 688                 X = pairwise_distances(X, metric=metric, **metric_kwds)
    689                 metric = 'precomputed'
    690             knn_indices, knn_distances, forest = compute_neighbors_umap(

~/.anaconda3/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1550         raise ValueError("Unknown metric %s. "
   1551                          "Valid metrics are %s, or 'precomputed', or a "
-> 1552                          "callable" % (metric, _VALID_METRICS))
   1553 
   1554     if metric == "precomputed":

ValueError: Unknown metric hellinger. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski', 'haversine'], or 'precomputed', or a callable

Similar to this code in UMAP (https://github.com/lmcinnes/umap/pull/259/files), we should check if scikit's pairwise_distances throws a ValueError and fallback to UMAP's own pairwise pairwise_special_metric function.

ivirshup commented 8 months ago

@flying-sheep, was this fixed by the recent update?

flying-sheep commented 1 month ago

It sure was!

One just has to specify transformer="pynndescent" to make it happen.

flying-sheep commented 1 month ago

The only thing missing from the test in #1413 is ll_dirichlet, which seems to be a metric that’s implemented in umap but not PyNNDescent.