scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
78 stars 17 forks source link

Selective read/write mudata modalities #63

Open racng opened 11 months ago

racng commented 11 months ago

Is your feature request related to a problem? Please describe. Reading and writing MuData is a bit slow sometimes. For example, after doing some TCR sequence analyses the MuData takes longer to read/write. Sometimes I added one annotation to mdata.obs but then it requires writing all modalities when saving. I appreciate that there is the ability to read and write one specific modality specified like mdata.h5mu/rna but there is no option to read and write only non-modality related elements like mdata.obs, mdata.var, mdata.obsm, etc. I imagine it could save time in different use cases.

Describe the solution you'd like Ability to specify list of modalities to read/write, with the option to give an empty list such that only mdata non-modality related elements are read/written. This could be implemented by an extra argument in existing MuData IO functions.

gtca commented 4 months ago

Thank you, @racng, for the detailed use case description!

Ideally we would stay close to the anndata's implementation of the backed mode but the interface for what you describe was scrapped there.

Just as in anndata, there's currently a backed mode in mudata that might help:

mdata = mudata.read("dataset.h5mu", backed=True)

I can also link related issues that discuss similar challenges in AnnData:

The last one showcases some ongoing work to make the API to read elements public but it's still work in progress. I am also not sure if writing data back on disk is part of that effort.

There's another experimental approach to handle out-of-memory operations with AnnData/MuData objects that you can try — https://github.com/scverse/shadows. It is not a stable library yet but hopefully it can work as a drop-in solution for your case.