scverse / muon

muon is a multimodal omics Python framework
https://muon.scverse.org/
BSD 3-Clause "New" or "Revised" License
218 stars 31 forks source link

Reading a backed AnnData from the .h5mu fails #17

Closed gtca closed 3 years ago

gtca commented 3 years ago

Expected to be able to read a modality in a backed mode but it currently fails:

mu.read('pbmc10k.h5mu/atac', backed=True)
# or 
mu.read_h5ad('pbmc10k.h5mu', mod='atac', backed=True)
# => TypeError: __init__() got an unexpected keyword argument 'mode'

It seems like mu.read_h5ad() passes {mode: hdf5_mode} to the AnnData constructor, which can't handle it.

ilia-kats commented 3 years ago

I can fix reading backed data relatively easily, but there is a conceptual question on how to deal with backed data in general. Imagine we have a backed MuData mu and we do this:

ad = mu.mod["rna"]
ad.filename = "some_other_file.h5ad"

This would currently error out, the question is what to do with this. For a normal backed AnnData object, it would write itself to its backing file and rename the backing file. Here we have two choices:

  1. We write to and rename the h5mu file, which would affect all modalities
  2. We create a new h5ad file and write the RNA modality to the new file. In this case, we would have two backing stores for the same MuData file: The original one, where all modalities are stored, and the new one, where only the RNA modality is stored. Arguably, this does not make sense.

I suppose the expectation with the above code is that the ad variable is backed to the new file without affecting the MuData object? In this case it's also not clear to me how to achieve this. Since ad and mu.mod["rna"] reference the same AnnData object, the filename setter would need to perform some introspection magic to copy the AnnData object and assign the copy to the ad variable, which would probably be very fragile.

The alternative would be to subclass AnnData and overide the methods dealing with file backing, either erroring out or forwarding the actions to the wrapping MuData object (essentially implementing case 1)

gtca commented 3 years ago

Would it make sense (and be possible) to just remove the ability to change the backing file name?..

ilia-kats commented 3 years ago

Several functionalities are affected: Changing the backing filename, the copy method, the writemethod, probably more that I can't think of at the moment. But yes, I guess we could disable all of them. But again, there is a caveat: If we have a mudata object, we can just refer the user to this object. But if we only have an anndata object backed by a h5mu file, as in this issue, then there would be no way at all to perform the above functions. Unless we introduce a special case for this, i.e. an extra subclass of AnnData or something.

And I'm afraid to even think of for example reading backed AnnData objects from two different h5mu files, constructing a new MuData object from these and trying to write that to somewhere. I have no idea at the moment how this would behave.

gtca commented 3 years ago

I think many things are actually returning errors when one tries to call modifying scanpy functions on the backed AnnData. So I would be inclined to think that as long as you can read the data, it's not a top priority to fix failing functionality like changing a backing file name. Of course it would be great if we could give a meaningful error message.

But maybe that's a more commonly used functionality in other people's workflows compared to my experience... Also errors are fine in this case, nasty side effects are problematic of course.

ilia-kats commented 3 years ago

Then I suppose this issue can be closed, reading should work now, everything else just errors out.

gtca commented 3 years ago

I think it fails if the .h5mu file doesn't have .raw slots in its modalities:

>>> pytest tests/test_io.py
AttributeError: Could not find dataset for raw X
ilia-kats commented 3 years ago

that should be fixed in the latest master (didn't reference this issue in that commit, sorry)

gtca commented 3 years ago

Ah, true, thanks a lot!

I'll close it for now then. And we can get back to this when/if we decide these challenges have to be addressed.

ilia-kats commented 3 years ago

Just had another idea for the API: What if we separate the syntax mu["rna"] from mu.mod["rna"]? The former would return an AnnData object that is coupled to the h5mu file, so no independent writes/copies, while the latter would return a decoupled object, that would detach itself from the h5mu file upon writing/copying (some subclass or wrapper around AnnData should do the trick, I think). Then we could also introduce a parameter to read_h5ad to indicate which kind of AnnData one wants. That way, everything would be explicit and documented and we would provide all the functionality.

Should we move this discussion to a new issue maybe?