scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
72 stars 17 forks source link

Propagate obs #3

Closed mffrank closed 11 months ago

mffrank commented 2 years ago

Implements the features from #2

Tested with this:

import mudata as md
import anndata as ad
import numpy as np

mod1 = ad.AnnData(np.arange(0, 100, 0.1).reshape(-1, 10))
mod2 = ad.AnnData(np.arange(101, 2101, 1).reshape(-1, 20))
mods = {"mod1": mod1, "mod2": mod2}
# Make var_names different in different modalities
for m in ["mod1", "mod2"]:
    mods[m].var_names = [f"{m}_var{i}" for i in range(mods[m].n_vars)]
    mods[m].obs['common'] = 1
mod1.obs['individual1'] = 2
mod2.obs['individual2'] = 3
mdata = md.MuData(mods)

# Now add some global columns
mdata.obs['to_propagate'] = 1
mdata.obs['mod2:individualadded'] = 4
mdata.propagate_obs()

mdata.obs
mdata.mod['mod1'].obs

Has some minimal tests and docstring, please let me know what's missing.

gtca commented 2 years ago

Thanks for going ahead to implement it! I think there is still place for some discussion about how this should behave. Let's consider different scenarios.

simple global column

mdata.obs["simple"] = np.arange(mdata.n_obs)
mdata.propagate_obs()
mdata.obs.columns  
# => Index(['rna:simple', 'atac:simple'], dtype='object')
mdata.obs.head(2)
#  =>                 rna:simple   atac:simple
# AAACAGCCAAATATCC-1           0             0
# AAACAGCCAGGAACTG-1           1             1

The column is successfully propagated but since it is now present in individual modalities, mdata.obs["simple"] is removed and modname:simple columns are added from modalities upon the update. Yes, this would allow to have modality.obs["simple"] but this removes mdata.obs["simple"], which did not have modality specificity before. Depends on the use case... but in general this does not seem ideal to me.

It might be that we consider changing .update() to include comparisons to decide if modname:simple can be collapsed to simple but I think we would like to avoid that (deterministic behaviour, speed, etc.).

Maybe the simplest way here is to keep propagated columns such as simple as global columns as well, or at least to have this as an option.

modality-specific change

mdata.obs["rna:simple"] = "x"
mdata.propagate_obs()

mdata.obs.head(2)
#  =>                 rna:simple   atac:simple
# AAACAGCCAAATATCC-1           0             0
# AAACAGCCAGGAACTG-1           1             1

While the doc currently indicates that the function will overwrite existing columns with the same name in individual modalities, it doesn't seem to be the case here. Is this intended?

It will overwrite the "simple" column when it's added to mdata.obs without modname though:

mdata.obs["simple"] = "x"
mdata.propagate_obs()
mdata.obs.head(2)
# =>                 rna:simple atac:simple
# AAACAGCCAAATATCC-1          x           x
# AAACAGCCAGGAACTG-1          x           x

Let me know what you think!

gtca commented 11 months ago

This is closed in favour of https://github.com/scverse/mudata/pull/57.