Closed mumichae closed 9 months ago
This is the call where it happens: https://github.com/scipy/scipy/blob/v1.11.4/scipy/sparse/_compressed.py#L707-L708
Which is this C++ function: https://github.com/scipy/scipy/blob/v1.11.4/scipy/sparse/sparsetools/csr.h#L1249-L1265
I think it would be good if you figure out the index/slice that _subset_spmatrix
was called with, as well as the dtypes of the sparse matrix’ components (indptr, …)
Here's the output I get
a.indptr.dtype: int32
a.indices.dtype: int32
subset_idx: (array([ 19, 31, 36, ..., 1058893, 1058897, 1058904]), slice(None, None, None))
subset_idx[0].dtype: int64
Funnily enough, the code works for the first subset of ~700k cells with the same dtypes, but fails with the current 70k subset.
I also checked the dtype of the complete matrix indices and it is also int32. Could it be that the inconsistency in index dtypes is the issue here?
Probably! @ivirshup had some recent fixes related to that, so he probably knows!
Update: changing index dtypes to int64 still leads to the same error
@mumichae, if you do:
mask = (adata.obs[split_key] == split).to_numpy()
adata.X[mask]
Do you get the same segfault?
I figured out the cause for the error and it was because the input files was corrupted due to the issue described in #1261. I first read a large anndata with adata.X in as a dask array with scipy.sparse.csr_matrix chunks, which were indexed with indptr in int32, whereas the full concatenated adata.X would have required int64
Please make sure these conditions are met
Report
When trying to subset a large anndata object (>1M cells, all in memory), I consistently get a segmenation fault. The peak memory used is about 30GB according to
/usr/bin/time
Code:
Traceback:
Versions