scverse / pertpy

Perturbation Analysis in the scverse ecosystem.
https://pertpy.readthedocs.io/en/latest/
MIT License
91 stars 19 forks source link

Bootstrap #502

Closed eroell closed 1 month ago

eroell commented 5 months ago

PR Checklist

Description of changes Addresses issue #497. This PR introduces convenience functionality for using (random sampling with replacement) bootstrapping towards error estimation of distances between perturbations.

For the user, no breaking changes are introduced. The changes visible are

Technical details

Additional context: Examples

__call__ vs bootstrap

import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap

adata = pt.dt.distance_example()

dist = pt.tl.Distance("euclidean")

print(dist(X, Y))
print(dist.bootstrap(X, Y, n_bootstrap=100))
2.640588
MeanVar(mean=3.3500843, variance=0.06426364)

onesided

import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap

adata = pt.dt.distance_example()

dist = pt.tl.Distance("euclidean")

dists_one = dist.onesided_distances(adata, groupby="perturbation", selected_group="control")
print(dists_one.head())

dists_one, vars_one = dist.onesided_distances(
    adata, groupby="perturbation", selected_group="control", bootstrap=True
)
print(dists_one.head())
print(vars_one.head())
perturbation
p-INTERGENIC393453     1.823859
p-sgELF1-2             1.992385
p-INTERGENIC216151     1.856206
p-INTERGENIC1144056    1.999334
p-sgELF1-5             2.132977

perturbation
p-INTERGENIC393453     2.194509
p-sgELF1-2             2.335306
p-INTERGENIC216151     2.192096
p-INTERGENIC1144056    2.310208
p-sgELF1-5             2.421595

perturbation
p-INTERGENIC393453     0.133328
p-sgELF1-2             0.150482
p-INTERGENIC216151     0.136785
p-INTERGENIC1144056    0.124505
p-sgELF1-5             0.158694

pairwise

import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap

adata = pt.dt.distance_example()

dist = pt.tl.Distance("euclidean")

dists = dist.pairwise(adata, groupby="perturbation")

means, vars = dist.pairwise(adata, groupby="perturbation", bootstrap=True)

clustermap(dists, robust=True, figsize=(10, 10))
clustermap(means, robust=True, figsize=(10, 10))
clustermap(vars, robust=True, figsize=(10, 10))

image

image

image

eroell commented 5 months ago

There are some tests failing, solving this together with maybe some more overhauling - opened draft PR with examples to raise some suggestions, especially with examples, of what our ideas could materialize to.

Slightly bugged that the bootstrapping mean is typically larger than the single-computed distance, not 100% sure about reason and validity right now

Zethson commented 5 months ago

Yeah, smaller test datasets are also fine. We have issues with some distances failing on smaller test datasets so we had to sadly keep them a bit on the larger side.

I'm fixing the scgen test now. The 3.11 pre-release fails due to something numba related.

eroell commented 5 months ago

Yeah, smaller test datasets are also fine. We have issues with some distances failing on smaller test datasets so we had to sadly keep them a bit on the larger side.

I'm fixing the scgen test now. The 3.11 pre-release fails due to something numba related.

(3 Failures caused by these changes right now I believe, to be fixed.)