scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.93k stars 603 forks source link

Error in Scanpy Doublet Analysis with Scrublet on .h5ad Data Using backed="r+" #3370

Closed simang5c closed 3 hours ago

simang5c commented 1 week ago

Please make sure these conditions are met

What happened?

I'm encountering an issue while running Scrublet for doublet analysis on an .h5ad file loaded with backed="r+" in Scanpy. The operation throws an error, likely due to the limitations of Scrublet working with backed mode, which restricts in-memory data manipulation.

Has anyone faced this issue before? If so, do you know of any workarounds or alternative approaches to run Scrublet on such data without having to fully load it into memory? Any suggestions would be greatly appreciated!

Minimal code sample

#path to file
output_file='/home/test_folder/project1_matrix.h5ad'

#reading the h5ad file which is contains around 1 million cells
#backed="r+" do not allow the adata.obs data to be modified.
adata = sc.read_h5ad(output_file, backed="r+")

sc.pp.scrublet(adata)

Error output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/test_env/lib/python3.12/site-packages/legacy_api_wrap/__init__.py", line 80, in fn_compatible
    return fn(*args_all, **kw)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/test_env/lib/python3.12/site-packages/scanpy/preprocessing/_scrublet/__init__.py", line 180, in scrublet
    adata = adata.copy()
            ^^^^^^^^^^^^
  File "/home/test_env/python3.12/site-packages/anndata/_core/anndata.py", line 1447, in copy
    raise ValueError(
ValueError: To copy an AnnData object in backed mode, pass a filename: `.copy(filename='myfilename.h5ad')`. To load the object into memory, use `.to_memory()`.

Versions

``` sc.logging.print_versions() ----- anndata 0.11.1 scanpy 1.10.4 ----- PIL 11.0.0 absl NA attr 24.2.0 cffi 1.17.1 chex 0.1.87 cycler 0.12.1 cython_runtime NA dateutil 2.9.0.post0 distutils 3.12.6 docrep 0.3.2 doubletdetection 4.2 etils 1.10.0 filelock 3.16.1 flax 0.10.1 fsspec 2024.10.0 h5py 3.12.1 igraph 0.11.8 jaraco NA jax 0.4.35 jaxlib 0.4.35 joblib 1.4.2 kiwisolver 1.4.7 lazy_loader 0.4 legacy_api_wrap NA leidenalg 0.10.2 lightning 2.4.0 lightning_utilities 0.11.8 llvmlite 0.43.0 louvain 0.8.2 matplotlib 3.9.2 ml_collections 1.0.0 ml_dtypes 0.5.0 more_itertools 10.5.0 mpl_toolkits NA mpmath 1.3.0 msgpack 1.1.0 mudata 0.3.1 multipledispatch 0.6.0 natsort 8.4.0 numba 0.60.0 numexpr 2.10.1 numpy 1.26.4 numpyro 0.15.3 nvidia NA opt_einsum 3.4.0 optax 0.2.4 packaging 24.2 pandas 2.2.3 phenograph 1.5.7 pkg_resources NA platformdirs 4.3.6 psutil 6.1.0 pycparser 2.22 pygments 2.18.0 pynndescent 0.5.13 pyparsing 3.2.0 pyro 1.9.1 pytz 2024.2 rich NA scipy 1.14.1 scvi 1.2.0 session_info 1.0.0 setuptools 74.1.2 six 1.16.0 skimage 0.24.0 sklearn 1.5.2 sparse 0.15.4 sympy 1.13.1 tables 3.10.1 texttable 1.7.0 threadpoolctl 3.5.0 toolz 1.0.0 torch 2.5.1+cu124 torchgen NA torchmetrics 1.6.0 tqdm 4.67.0 triton 3.1.0 typing_extensions NA wcwidth 0.2.13 xarray 2024.10.0 yaml 6.0.2 zstandard 0.23.0 ----- Python 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:16:49) [GCC 13.3.0] Linux-6.8.0-48-generic-x86_64-with-glibc2.39 ----- Session information updated at 2024-11-15 11:48 ```
ilan-gold commented 3 hours ago

Hello! We do not support backed mode for scrublet! However if you wish to contribute this, we would be more than happy. Alternatively, and probably a more sustainable solution, would be to add dask support (which you can begin to use via https://anndata.readthedocs.io/en/stable/generated/anndata.experimental.read_elem_as_dask.html). I'm going to close because we already have an issue for this: https://github.com/scverse/scanpy/issues/2578)