scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
72 stars 17 forks source link

write_zarr method #6

Closed keller-mark closed 1 year ago

keller-mark commented 2 years ago

Is your feature request related to a problem? Please describe. HDF5 is difficult / impossible to read in JavaScript. AnnData objects have a .write_zarr method. Zarr stores can be easily read in JavaScript with libraries such as https://github.com/gzuidhof/zarr.js/.

Describe the solution you'd like MuData objects should have an analogous .write_zarr method.

Additional context https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.write_zarr.html#anndata.AnnData.write_zarr

ivirshup commented 2 years ago

Reference to other issue: https://github.com/vitessce/vitessce/issues/1119

gtca commented 2 years ago

Two related questions:

  1. @ivirshup, should we handle backed data better than the current implementation in anndata currently does?

    Failed to write value for X, since a writer for type <class 'h5py._hl.dataset.Dataset'> has not been implemented yet.
  2. @ivirshup, I would be inclined to have dynamic dispatch for MuData.write() based on the file extension: e.g. .write_zarr() for .zarr and .write_h5mu() for the rest. Currently adata.write("file.zarr") will write an HDF5 file to file.zarr.

ivirshup commented 2 years ago

should we handle backed data better than the current implementation in anndata currently does?

Could it just be fixed it in anndata?

I would be inclined to have dynamic dispatch

I prefer explicit over implicit here. Plus if it's h5mu for every other extesion what happens when there's a new backend?

adata.write("file.zarr") will write an HDF5 file to file.zarr

The docstring does say that's what it will do. I don't think I would have defined a .write method.

gtca commented 2 years ago

should we handle backed data better than the current implementation in anndata currently does?

Could it just be fixed it in anndata?

Yes. that's what I meant, just wanted to start this discussion here. I just haven't been aware if there are any reasons not to support backed data for.write_zarr().


For .write(), we have control over backends that anndata/mudata support, and we have a default one (HDF5) and others, currently (and in the near future) that's only .zarr. I agree with the explicitness argument but I'd say the current behaviour is somewhat user-unfriendly — clearly I don't want adata.write("adata.zarr") to write an HDF5 file...

ivirshup commented 2 years ago

I just haven't been aware if there are any reasons not to support backed data for .write_zarr().

I can't think of a reason. Just needs implementing I think.

clearly I don't want adata.write("adata.zarr") to write an HDF5 file...

I would lean towards deprecating .write over putting more logic in it.

mruffalo commented 2 years ago

:+1: for the ability to write in Zarr format; thanks for the interest and discussion!

My HuBMAP center is working toward integrative analysis of multiomic data (SNARE-seq, 10X multiome, CITE-seq), and MuData seems like a fantastic data layout. Serialization to Zarr is a significant requirement for visualization in the HuBMAP portal though, hence @keller-mark opening this issue.

Would a PR be welcome to implement MuData.write_zarr?

Side note: is it worth trying to adopt a Zarr directory extension convention to match .h5/.hdf5.h5ad and .h5mu, like .zarr.zrad and .zrmu, or .zarr-ad and .zarr-mu?

gtca commented 2 years ago

Hey @mruffalo, thanks for chiming in!

There is already MuData.write_zarr() that should work, thanks to @keller-mark and his https://github.com/scverse/mudata/pull/7. Does it implement the functionality you're looking for?

It makes me realise we should add more about .zarr into our documentation pages!


For the directory naming with custom extensions, I am not sure this is something that is planned, maybe @ivirshup can share more thoughts or details on that.

mruffalo commented 2 years ago

Hey @gtca -- thanks so much, and sorry for missing that! I hadn't seen anything in the muon documentation about writing Zarr, so I hadn't realized this was already implemented. I was also a bit thrown off by the distinction between muon and mudata and which package I should check for this functionality, in addition to this issue still being open.

I see muon.read_zarr and muon.MuData.write_zarr functions/methods, so this looks good! My initial comment was to verify that data in this format would be usable by the HuBMAP UI component, of whom @keller-mark is a member, so it's great to know that all is well.

gtca commented 2 years ago

No worries at all!

Just to clarify a bit about mudata/muon, this is the same as with anndata/scanpy, where most user-facing features are in the frameworks, and the format implementation is intended to be more stable, to have a small number of dependencies and thus also to appeal to developers who build pipelines and tools using the scverse data structures. That being said, we're not fussed which repository initial discussions happen in, especially since it's easy to cross-link issues and discussions.

Looking forward to the new HuBMAP developments! And of course feel free to start another discussion or a PR in case more functionality is required, here or in the AnnData repo.