scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
72 stars 16 forks source link

Unexpected behaviour when assigning AnnData object to a MuData object modality #27

Closed VladimirShitov closed 1 year ago

VladimirShitov commented 1 year ago

Describe the bug Imagine that you have Mudata object named mdata. You extract one modality from it to AnnData object adata, do something with it (e.g., subsetting genes) and then trying to store adata in the same modality it came from. It doesn't work as expected though. The n_vars is inconsistent with actual number of variables:

image

It also causes the problem with saving file:

image

Weirdly, after trying to save file and getting error, everything seems to be fixed:

image

To Reproduce Hope, the problem is clean from the screenshot. If you need the reproducible example, please write the comment about it

Expected behaviour Consistent number of variables and saving data from the first try

System

gtca commented 1 year ago

Hey @VladimirShitov,

While I can't actually guarantee the current version of MuData supports reassigning modalities with intentionally defined behaviour, it does actually work and seems to work as expected. The multimodal object contains (1) modalities and (2) multimodal information. Individual modalities can be modified but for MuData to know about this (and to update the multimodal information accordingly), one has to explicitly do this by calling mdata.update() (also see here). Given that, the following works:

from mudata import AnnData, MuData, read_h5mu
import numpy as np
import jax

x = np.array(jax.random.normal(jax.random.PRNGKey(1), (10, 100)))
adata = AnnData(x)
mdata = MuData({"one": adata})
mdata.shape  # => (10, 100)

adata2 = adata[:,:20].copy()
mdata.mod["one"] = adata2
mdata.update()
mdata.shape # => (10, 20)

mdata.write("issue27.h5mu")
read_h5mu("issue27.h5mu").shape # => (10, 20)

When calling .write(), .update() should actually also be called automatically inside it so that unupdated information is not serialised.

With this, I cannot reproduce the issue with serialising mdata — unless the feature names are changed before assignment. If they were, one should get an informative error message (at least on the latest master):

adata2 = adata[:,:20].copy()
adata2.var_names = [f"var{i}" for i in adata2.var_names]
mdata.mod["one"] = adata2
mdata.update()
# => NotImplementedError: var_names seem to have been renamed and filtered at the same time. There is no way to restore the order.

That seems reasonable: with a different number of features and different names for them, MuData has no way of knowing which ones were kept after filtering data.

If one assigned it as another modality instead, it works as expected:

mdata = MuData({"one": adata})
mdata.mod["two"] = adata2
mdata.update()
mdata.shape # => (10, 120)

I hope that makes sense!

VladimirShitov commented 1 year ago

Thank you for the answer! Indeed, feature names have been changed in my case. I solved my problem anyway.

By the way, maybe it makes sense to run .update() automatically every time when assigning something to modality? It can be done using @property and a setter, though I am not sure how to do it with dictionary values.

gtca commented 1 year ago

Thanks @VladimirShitov! It should possible to wrap the dictionary of modalities into a custom class to do that but .update() will still be required in other cases as MuData's attributes can't be updated automatically when AnnData objects change externally. We can definitely look into reducing the amount of .update() calls required when we technically can though (incl. muon calls). Just to note, (re)assigning modalities is not something encouraged by design in the current workflows (e.g. mdata[modality] does not have a setter).

In case there are use cases that you find particularly useful and that would require changes like this to be implemented, feel free to share so that we can discuss them and figure out the best way to address them!