Closed gtca closed 3 years ago
I can fix reading backed data relatively easily, but there is a conceptual question on how to deal with backed data in general. Imagine we have a backed MuData mu and we do this:
ad = mu.mod["rna"]
ad.filename = "some_other_file.h5ad"
This would currently error out, the question is what to do with this. For a normal backed AnnData object, it would write itself to its backing file and rename the backing file. Here we have two choices:
I suppose the expectation with the above code is that the ad variable is backed to the new file without affecting the MuData object? In this case it's also not clear to me how to achieve this. Since ad
and mu.mod["rna"]
reference the same AnnData object, the filename
setter would need to perform some introspection magic to copy the AnnData object and assign the copy to the ad
variable, which would probably be very fragile.
The alternative would be to subclass AnnData and overide the methods dealing with file backing, either erroring out or forwarding the actions to the wrapping MuData object (essentially implementing case 1)
Would it make sense (and be possible) to just remove the ability to change the backing file name?..
Several functionalities are affected: Changing the backing filename, the copy
method, the write
method, probably more that I can't think of at the moment. But yes, I guess we could disable all of them. But again, there is a caveat: If we have a mudata object, we can just refer the user to this object. But if we only have an anndata object backed by a h5mu file, as in this issue, then there would be no way at all to perform the above functions. Unless we introduce a special case for this, i.e. an extra subclass of AnnData or something.
And I'm afraid to even think of for example reading backed AnnData objects from two different h5mu files, constructing a new MuData object from these and trying to write that to somewhere. I have no idea at the moment how this would behave.
I think many things are actually returning errors when one tries to call modifying scanpy functions on the backed AnnData. So I would be inclined to think that as long as you can read the data, it's not a top priority to fix failing functionality like changing a backing file name. Of course it would be great if we could give a meaningful error message.
But maybe that's a more commonly used functionality in other people's workflows compared to my experience... Also errors are fine in this case, nasty side effects are problematic of course.
Then I suppose this issue can be closed, reading should work now, everything else just errors out.
I think it fails if the .h5mu file doesn't have .raw
slots in its modalities:
>>> pytest tests/test_io.py
AttributeError: Could not find dataset for raw X
that should be fixed in the latest master (didn't reference this issue in that commit, sorry)
Ah, true, thanks a lot!
I'll close it for now then. And we can get back to this when/if we decide these challenges have to be addressed.
Just had another idea for the API: What if we separate the syntax mu["rna"]
from mu.mod["rna"]
? The former would return an AnnData object that is coupled to the h5mu file, so no independent writes/copies, while the latter would return a decoupled object, that would detach itself from the h5mu file upon writing/copying (some subclass or wrapper around AnnData should do the trick, I think). Then we could also introduce a parameter to read_h5ad
to indicate which kind of AnnData one wants. That way, everything would be explicit and documented and we would provide all the functionality.
Should we move this discussion to a new issue maybe?
Expected to be able to read a modality in a backed mode but it currently fails:
It seems like
mu.read_h5ad()
passes{mode: hdf5_mode}
to the AnnData constructor, which can't handle it.