Segmentation fault when copying subsets of large datasets

mumichae commented 9 months ago

Please make sure these conditions are met

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of anndata.
[ ] (optional) I have confirmed this bug exists on the master branch of anndata.

Report

When trying to subset a large anndata object (>1M cells, all in memory), I consistently get a segmenation fault. The peak memory used is about 30GB according to /usr/bin/time

Code:

for split_file in split_files:
    split = file_value_map.get(split_file)
    out_file = out_dir / f"value~{split_file}.zarr"

    # split anndata
    logging.info(f'Split by {split_key}={split}')
    adata_sub = adata[adata.obs[split_key] == split]
    logging.info(adata_sub.__str__())

    # write to file
    logging.info(f'Write to {out_file}...')
    adata_sub.write_zarr(out_file)
    del adata_sub

Traceback:

Fatal Python error: Segmentation fault   

Current thread 0x00007ff2d9fa3740 (most recent call first):
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/scipy/sparse/_compressed.py", line 707 in _major_index_fancy
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/scipy/sparse/_csr.py", line 335 in _get_arrayXslice
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/scipy/sparse/_index.py", line 75 in __getitem__
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_core/index.py", line 177 in _subset_spmatrix
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/functools.py", line 889 in wrapper
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_core/anndata.py", line 683 in X
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_io/specs/methods.py", line 250 in write_anndata
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_io/specs/registry.py", line 57 in wrapper
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_io/zarr.py", line 46 in callback
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_io/specs/registry.py", line 310 in write_elem
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_io/utils.py", line 243 in func_wrapper
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/experimental/_dispatch_io.py", line 99 in write_dispatched
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_io/zarr.py", line 48 in write_zarr
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_io/__init__.py", line 20 in write_zarr
  File "/hpc/projects/hca_integration/mambaforge/envs/scanpy/lib/python3.10/site-packages/anndata/_core/anndata.py", line 2080 in write_zarr
  File "/hpc/projects/hca_integration/workspace/immune-v1-scripts/.snakemake/scripts/tmp83idbt4z.split_anndata.py", line 109 in <module>

Versions

-----                                                                                                                                                                                                                                                                                                                                                                              [5/1821]
anndata             0.10.4
dask                2024.1.1
numpy               1.26.3
scanpy              1.10.0.dev198+g89869212
scipy               1.11.4
session_info        1.0.0
snakemake           7.31.1
sparse              0.15.1
-----
PIL                 10.2.0
asciitree           NA
cffi                1.15.1
cloudpickle         3.0.0
colorama            0.4.6
cycler              0.12.1
cython_runtime      NA
cytoolz             0.12.2
dateutil            2.8.2
defusedxml          0.7.1
exceptiongroup      1.2.0
fasteners           0.17.3
h5py                3.10.0
igraph              0.11.3
importlib_metadata  NA
iniconfig           NA
jinja2              3.1.3
joblib              1.3.2
kiwisolver          1.4.5
legacy_api_wrap     NA
leidenalg           0.10.2
llvmlite            0.41.1
lz4                 4.3.3
markupsafe          2.1.4
matplotlib          3.8.2
mpl_toolkits        NA
msgpack             1.0.7
natsort             8.4.0
numba               0.58.1
numcodecs           0.12.1
packaging           23.2
pandas              2.2.0
pluggy              1.2.0
psutil              5.9.8
py                  NA
pyarrow             15.0.0
pyparsing           3.1.1
pytest              7.4.0
pytz                2023.4
setuptools_scm      NA
six                 1.16.0
sklearn             1.4.0
tblib               3.0.0
texttable           1.7.0
threadpoolctl       3.2.0
tlz                 0.12.2
toolz               0.12.1
typing_extensions   NA
wcwidth             0.2.13
yaml                6.0.1
zarr                2.16.1
zipp                NA
zoneinfo            NA
-----
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.28
-----

flying-sheep commented 9 months ago

This is the call where it happens: https://github.com/scipy/scipy/blob/v1.11.4/scipy/sparse/_compressed.py#L707-L708

Which is this C++ function: https://github.com/scipy/scipy/blob/v1.11.4/scipy/sparse/sparsetools/csr.h#L1249-L1265

I think it would be good if you figure out the index/slice that _subset_spmatrix was called with, as well as the dtypes of the sparse matrix’ components (indptr, …)

mumichae commented 9 months ago

Here's the output I get

a.indptr.dtype: int32
a.indices.dtype: int32
subset_idx: (array([     19,      31,      36, ..., 1058893, 1058897, 1058904]), slice(None, None, None))
subset_idx[0].dtype: int64

Funnily enough, the code works for the first subset of ~700k cells with the same dtypes, but fails with the current 70k subset.

mumichae commented 9 months ago

I also checked the dtype of the complete matrix indices and it is also int32. Could it be that the inconsistency in index dtypes is the issue here?

flying-sheep commented 9 months ago

Probably! @ivirshup had some recent fixes related to that, so he probably knows!

See https://github.com/scverse/anndata/issues/1261

mumichae commented 9 months ago

Update: changing index dtypes to int64 still leads to the same error

ivirshup commented 9 months ago

@mumichae, if you do:

mask = (adata.obs[split_key] == split).to_numpy()
adata.X[mask]

Do you get the same segfault?

mumichae commented 9 months ago

I figured out the cause for the error and it was because the input files was corrupted due to the issue described in #1261. I first read a large anndata with adata.X in as a dask array with scipy.sparse.csr_matrix chunks, which were indexed with indptr in int32, whereas the full concatenated adata.X would have required int64

scverse / anndata