theislab / ehrapy

Electronic Health Record Analysis with Python.
https://ehrapy.readthedocs.io/
Apache License 2.0
232 stars 19 forks source link

Add approximate KNN backend #759

Closed eroell closed 3 days ago

eroell commented 5 months ago

Description of feature

scanpy has a fast approximate KNN backend option via the transfomer argument for pp.neighbors, which we block at the moment.

Adding this can overcome a significant bottleneck for large datasets

eroell commented 3 months ago

So in a bit more detail:

scanpy allows to use alternative knn backends, see here for a tutorial.

This makes it possible to compute kNN matrices with a default kNN implementation

import scanpy as sc

adata = sc.datasets.blobs(n_variables=1000, n_centers=4, n_observations=10000)
sc.pp.neighbors(adata)

or with faster backends

import scanpy as sc
from sklearn_ann.kneighbors.annoy import AnnoyTransformer

adata = sc.datasets.blobs(n_variables=1000, n_centers=4, n_observations=10000)
sc.pp.neighbors(adata, transformer=AnnoyTransformer(5))

In ehrapy, the transformer argument is not yet implemented:

While the default kNN implementation is available

import scanpy as sc

adata = sc.datasets.blobs(n_variables=1000, n_centers=4, n_observations=10000)
ep.pp.neighbors(adata)

using an sklearn-like Transformer is not supported; having this option can be a speedup for users with large datasets.

# this fails!
import ehrapy as ep
import scanpy as sc
from sklearn_ann.kneighbors.annoy import AnnoyTransformer

adata = sc.datasets.blobs(n_variables=1000, n_centers=4, n_observations=10000)
ep.pp.neighbors(adata, transformer=AnnoyTransformer(5)) # FAILS
TypeError: neighbors() got an unexpected keyword argument 'transformer'