Open jonas2612 opened 10 months ago
Interesting issue! You've caught most of us traveling, so no chance to replicate yet. I suspect this may have to do with some recent internal changes in scipy I made... I'll be interested to see if pinning scipy lower helps fix this.
As there is no mouse_gastrulation_atlas.h5ad
file in the folder you link, what does that correspond to? Is it the each of those files concatenated together?
Yes, exactly. The file is the concatenation of all JAX-files
Does this happen with any one of the files, or do they have to be concatenated for this to occur?
To merge the files I first concatenated two files each with anndata
, but the memory requirements for concatenating the results are too large, so I used a function of my own to concatenate .X
. I currently believe that there the issue originates from, as I can subset the intermediate files. The function I used was
from pathlib import Path
import h5py
from scipy import sparse
import anndata as ad
from anndata._core.sparse_dataset import SparseDataset
from anndata.experimental import read_elem, write_elem
def read_everything_but_X(pth) -> ad.AnnData:
attrs = ["obs", "var", "obsm", "varm", "obsp", "varp", "uns"]
with h5py.File(pth) as f:
adata = ad.AnnData(**{k: read_elem(f[k]) for k in attrs})
return adata
def concat_on_disk(input_pths: list[Path], output_pth: Path):
"""
Params
------
input_pths
Paths to h5ad files which will be concatenated
output_pth
File to write as a result
"""
annotations = ad.concat([read_everything_but_X(pth) for pth in input_pths])
annotations.write_h5ad(output_pth)
n_variables = annotations.shape[1]
del annotations
with h5py.File(output_pth, "a") as target:
dummy_X = sparse.csr_matrix((0, n_variables), dtype="float64")
dummy_X.indptr = dummy_X.indptr.astype("int64") # Guarding against overflow for very large datasets
write_elem(target, "X", dummy_X)
mtx = SparseDataset(target["X"])
for p in pths:
with h5py.File(p, "r") as src:
mtx.append(SparseDataset(src["X"]))
Alas, I need .X
, so the lazy concatenation from anndata
isn't an option.
I've gotten a chance to look at this. But haven't had a chance to use the exact same file.
concat_on_disk
however.With the new file, I am able to do things like:
adata = ad.read_h5ad(ADATA_PTH, backed="r")
result = adata[adata.obs["experimental_batch"]!='run_22'].to_memory()
Or:
half = adata[:5_000_000].to_memory()
not_run_19 = half[half.obs["experimental_batch"] != "run_19"].copy()
So maybe this works with new file. Or maybe I'm not hitting the right conditions yet. This vaguely looks like a recent issue in scipy where index dtypes weren't large enough, which I'd like to know we're avoiding https://github.com/scipy/scipy/issues/19205.
@jonas2612 could you try and replicate the original issue with the new file before I try and dig in some more?
Yes, I will try to recreate the file tomorrow. I'll let you know once it's build
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
Please make sure these conditions are met
Report
I haven't been able to reproduce the error(s) on a smaller example. The dataset can be downloaded and assembled from here https://shendure-web.gs.washington.edu/content/members/cxqiu/public/nobackup/jax/download/adata/.
X
is in csr undexperimental_batch
is saved as categories. I receive two different error messages dependent on subsetting by equality or inequalityCode:
Traceback:
Code:
Traceback:
Versions