theislab / ehrapy

Electronic Health Record Analysis with Python.
https://ehrapy.readthedocs.io/
Apache License 2.0
232 stars 19 forks source link

Enhancement/normalization dask #763

Closed eroell closed 4 months ago

eroell commented 4 months ago

PR Checklist

Description of changes

Allow normalization methods to work with (dense) dask array. Suggest ehrapy[dask] for dependency management. Might make dask a dependency in the future.

Technical details

Additional context

Example, profiled with scalene (run below python scripts as scalene <scriptname>.py, demonstrating how ep.pp.scale_norm does not trigger the computations and is not performance bottleneck:

In memory (numpy) array

import scalene
scalene.scalene_profiler.stop()
import pandas as pd
from sklearn.datasets import make_blobs as make_blobs
import ehrapy as ep
import anndata as ad
import scanpy as sc
n_individuals = 50000
n_features = 1000
n_groups = 4
chunks = 1000
data_features, data_labels = make_blobs(n_samples=n_individuals, n_features=n_features, centers=n_groups, random_state=42)
var = pd.DataFrame({"feature_type": ["numeric"] * n_features})
adata = ad.AnnData(X=data_features, obs={"label": data_labels}, var=var)
scalene.scalene_profiler.start()
ep.pp.scale_norm(adata)
ep.pp.pca(adata)
sc.pp.neighbors(adata)
ep.tl.leiden(adata)
ep.pl.pca(adata, color="leiden", save="profiling_memory_pca.png")
scalene.scalene_profiler.stop()

memory_profile_50000x1000

Out-of-memory (dask) array

import scalene
scalene.scalene_profiler.stop()
import dask.array as da
from sklearn.datasets import make_blobs as make_blobs
import ehrapy as ep
import anndata as ad
import pandas as pd
import scanpy as sc
n_individuals = 50000
n_features = 1000
n_groups = 4
chunks = 1000
data_features, data_labels = make_blobs(n_samples=n_individuals, n_features=n_features, centers=n_groups, random_state=42)
data_features = da.from_array(data_features, chunks=chunks)
var = pd.DataFrame({"feature_type": ["numeric"] * n_features})
adata = ad.AnnData(X=data_features, obs={"label": data_labels}, var=var)
scalene.scalene_profiler.start()
ep.pp.scale_norm(adata)
ep.pp.pca(adata)
adata.obsm["X_pca"] = adata.obsm["X_pca"].compute()
sc.pp.neighbors(adata)
sc.tl.leiden(adata)
sc.pl.pca(adata, color="leiden", save="profiling_out_of_core_pca.png")
scalene.scalene_profiler.stop()

out_of_core_profile_50000x1000

eroell commented 4 months ago

weeeird, the pull_request_target seems to still appear, even though we fixed this to use the pull_request trigger in run_notebooks.yml. But I can see the notebooks which should work triggered on pull_request actually pass.

Might disappear once this is merged, lets see how this behaves in future PRs. Big thanks @flying-sheep @ilan-gold .