Errors caused by X when subsetting AnnData object

jonas2612 commented 10 months ago

Please make sure these conditions are met

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of anndata.
[ ] (optional) I have confirmed this bug exists on the master branch of anndata.

Report

I haven't been able to reproduce the error(s) on a smaller example. The dataset can be downloaded and assembled from here https://shendure-web.gs.washington.edu/content/members/cxqiu/public/nobackup/jax/download/adata/. X is in csr und experimental_batch is saved as categories. I receive two different error messages dependent on subsetting by equality or inequality

Code:

import scanpy as sc

adata = sc.read('/lustre/groups/ml01/workspace/monge_velo/data/mouse_gastrulation_atlas.h5ad')
adata_tmp1 = adata[adata.obs["experimental_batch"]!='run_22']
adata_tmp1.X

Traceback:

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/anndata/_core/anndata.py:612, in AnnData.X(self)
    609     X = None
    610 elif self.is_view:
    611     X = as_view(
--> 612         _subset(self._adata_ref.X, (self._oidx, self._vidx)),
    613         ElementRef(self, "X"),
    614     )
    615 else:
    616     X = self._X

File ~/mambaforge/envs/atlas_pca/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/anndata/_core/index.py:140, in _subset_spmatrix(a, subset_idx)
    138 if len(subset_idx) > 1 and all(isinstance(x, cabc.Iterable) for x in subset_idx):
    139     subset_idx = (subset_idx[0].reshape(-1, 1), *subset_idx[1:])
--> 140 return a[subset_idx]

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/scipy/sparse/_index.py:75, in IndexMixin.__getitem__(self, key)
     73         return self._get_arrayXint(row, col)
     74     elif isinstance(col, slice):
---> 75         return self._get_arrayXslice(row, col)
     76 else:  # row.ndim == 2
     77     if isinstance(col, INT_TYPES):

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/scipy/sparse/_csr.py:335, in _csr_base._get_arrayXslice(self, row, col)
    333     col = np.arange(*col.indices(self.shape[1]))
    334     return self._get_arrayXarray(row, col)
--> 335 return self._major_index_fancy(row)._get_submatrix(minor=col)

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/scipy/sparse/_compressed.py:705, in _cs_matrix._major_index_fancy(self, idx)
    702 np.cumsum(row_nnz, out=res_indptr[1:])
    704 nnz = res_indptr[-1]
--> 705 res_indices = np.empty(nnz, dtype=idx_dtype)
    706 res_data = np.empty(nnz, dtype=self.dtype)
    707 csr_row_index(M, indices, self.indptr, self.indices, self.data,
    708               res_indices, res_data)

ValueError: negative dimensions are not allowed

Code:

import scanpy as sc

adata = sc.read('/lustre/groups/ml01/workspace/monge_velo/data/mouse_gastrulation_atlas.h5ad')
adata_tmp2 = adata[adata.obs["experimental_batch"]=='run_22']
adata_tmp2.X

Traceback:

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/anndata/_core/anndata.py:612, in AnnData.X(self)
    609     X = None
    610 elif self.is_view:
    611     X = as_view(
--> 612         _subset(self._adata_ref.X, (self._oidx, self._vidx)),
    613         ElementRef(self, "X"),
    614     )
    615 else:
    616     X = self._X

File ~/mambaforge/envs/atlas_pca/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/anndata/_core/index.py:140, in _subset_spmatrix(a, subset_idx)
    138 if len(subset_idx) > 1 and all(isinstance(x, cabc.Iterable) for x in subset_idx):
    139     subset_idx = (subset_idx[0].reshape(-1, 1), *subset_idx[1:])
--> 140 return a[subset_idx]

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/scipy/sparse/_index.py:75, in IndexMixin.__getitem__(self, key)
     73         return self._get_arrayXint(row, col)
     74     elif isinstance(col, slice):
---> 75         return self._get_arrayXslice(row, col)
     76 else:  # row.ndim == 2
     77     if isinstance(col, INT_TYPES):

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/scipy/sparse/_csr.py:335, in _csr_base._get_arrayXslice(self, row, col)
    333     col = np.arange(*col.indices(self.shape[1]))
    334     return self._get_arrayXarray(row, col)
--> 335 return self._major_index_fancy(row)._get_submatrix(minor=col)

File ~/mambaforge/envs/atlas_pca/lib/python3.10/site-packages/scipy/sparse/_compressed.py:707, in _cs_matrix._major_index_fancy(self, idx)
    705 res_indices = np.empty(nnz, dtype=idx_dtype)
    706 res_data = np.empty(nnz, dtype=self.dtype)
--> 707 csr_row_index(M, indices, self.indptr, self.indices, self.data,
    708               res_indices, res_data)
    710 return self.__class__((res_data, res_indices, res_indptr),
    711                       shape=new_shape, copy=False)

ValueError: Output dtype not compatible with inputs.

Versions

``` ----- anndata 0.9.2 numpy 1.24.4 pandas 2.0.3 scanpy 1.9.3 scipy 1.11.2 session_info 1.0.0 ----- PIL 10.0.0 asciitree NA asttokens NA backcall 0.2.0 cloudpickle 2.2.1 comm 0.1.2 cycler 0.10.0 cython_runtime NA cytoolz 0.12.2 dask 2023.8.1 dateutil 2.8.2 debugpy 1.5.1 decorator 5.1.1 entrypoints 0.4 executing 0.8.3 fasteners 0.18 h5py 3.9.0 importlib_metadata NA ipykernel 6.19.2 jedi 0.18.1 jinja2 3.1.2 joblib 1.3.2 kiwisolver 1.4.4 llvmlite 0.40.1 lz4 4.3.2 markupsafe 2.1.3 matplotlib 3.7.2 mpl_toolkits NA msgpack 1.0.5 natsort 8.4.0 numba 0.57.1 numcodecs 0.11.0 packaging 23.1 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 3.10.0 prompt_toolkit 3.0.36 psutil 5.9.5 ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 12.0.1 pydev_ipython NA pydevconsole NA pydevd 2.6.0 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.11.2 pyparsing 3.0.9 pytz 2023.3 six 1.16.0 sklearn 1.3.0 stack_data 0.2.0 tblib 1.7.0 threadpoolctl 3.2.0 tlz 0.12.2 toolz 0.12.0 tornado 6.3.3 traitlets 5.7.1 typing_extensions NA wcwidth 0.2.5 yaml 6.0 zarr 2.16.1 zipp NA zmq 23.2.0 zoneinfo NA ----- IPython 8.8.0 jupyter_client 7.4.8 jupyter_core 5.1.1 ----- Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.28 ```

ivirshup commented 9 months ago

Interesting issue! You've caught most of us traveling, so no chance to replicate yet. I suspect this may have to do with some recent internal changes in scipy I made... I'll be interested to see if pinning scipy lower helps fix this.

As there is no mouse_gastrulation_atlas.h5ad file in the folder you link, what does that correspond to? Is it the each of those files concatenated together?

jonas2612 commented 9 months ago

Yes, exactly. The file is the concatenation of all JAX-files

ivirshup commented 9 months ago

Does this happen with any one of the files, or do they have to be concatenated for this to occur?

jonas2612 commented 9 months ago

To merge the files I first concatenated two files each with anndata, but the memory requirements for concatenating the results are too large, so I used a function of my own to concatenate .X. I currently believe that there the issue originates from, as I can subset the intermediate files. The function I used was

from pathlib import Path

import h5py
from scipy import sparse

import anndata as ad
from anndata._core.sparse_dataset import SparseDataset
from anndata.experimental import read_elem, write_elem

def read_everything_but_X(pth) -> ad.AnnData:
    attrs = ["obs", "var", "obsm", "varm", "obsp", "varp", "uns"]
    with h5py.File(pth) as f:
        adata = ad.AnnData(**{k: read_elem(f[k]) for k in attrs})
    return adata

def concat_on_disk(input_pths: list[Path], output_pth: Path):
    """
    Params
    ------
    input_pths
        Paths to h5ad files which will be concatenated
    output_pth
        File to write as a result
    """
    annotations = ad.concat([read_everything_but_X(pth) for pth in input_pths])

    annotations.write_h5ad(output_pth)
    n_variables = annotations.shape[1]

    del annotations

    with h5py.File(output_pth, "a") as target:
        dummy_X = sparse.csr_matrix((0, n_variables), dtype="float64")
        dummy_X.indptr = dummy_X.indptr.astype("int64") # Guarding against overflow for very large datasets

        write_elem(target, "X", dummy_X)
        mtx = SparseDataset(target["X"])
        for p in pths:
            with h5py.File(p, "r") as src:
                mtx.append(SparseDataset(src["X"]))

Alas, I need .X, so the lazy concatenation from anndata isn't an option.

ivirshup commented 8 months ago

I've gotten a chance to look at this. But haven't had a chance to use the exact same file.

First, this is an 11 million cell dataset, so it has pretty large memory requirements to load and index the whole thing. I haven't managed to create a session with enough memory to that yet.
The exact file this was encountered with may no longer exist. @jonas2612 has recreated it with concat_on_disk however.

With the new file, I am able to do things like:

adata = ad.read_h5ad(ADATA_PTH, backed="r")
result = adata[adata.obs["experimental_batch"]!='run_22'].to_memory()

Or:

half = adata[:5_000_000].to_memory()
not_run_19 = half[half.obs["experimental_batch"] != "run_19"].copy()

So maybe this works with new file. Or maybe I'm not hitting the right conditions yet. This vaguely looks like a recent issue in scipy where index dtypes weren't large enough, which I'd like to know we're avoiding https://github.com/scipy/scipy/issues/19205.

@jonas2612 could you try and replicate the original issue with the new file before I try and dig in some more?

jonas2612 commented 8 months ago

Yes, I will try to recreate the file tomorrow. I'll let you know once it's build

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

scverse / anndata