openpipelines-bio / openpipeline

https://openpipelines.bio
MIT License
29 stars 14 forks source link

Only read in part of MuData file. #326

Open DriesSchaumont opened 1 year ago

DriesSchaumont commented 1 year ago

https://mudata.readthedocs.io/en/latest/api/generated/mudata.read.html

DriesSchaumont commented 1 year ago

I was wondering what the effect was on .obs and .var when saving a anndata file to a modality of an existing mudata. Seems like they get updated:

>>> import pandas as pd
>>> from anndata import AnnData
>>> 
>>> def test_mudata():
...     df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=["obs1", "obs2"], columns=["var1", "var2", "var3"])
...     obs = pd.DataFrame([["A", "sample1"], ["B", "sample2"]], index=df.index, columns=["Obs", "sample_id"])
...     var = pd.DataFrame([["a", "sample1"], ["b", "sample2"], ["c", "sample1"]],
...                        index=df.columns, columns=["Feat", "sample_id_var"])
...     obsm = {"obsm_key": pd.DataFrame([["foo", "bar"], ["lorem", "ipsum"]],
...                                      index=obs.index, columns=["obsm_col1", "obsm_col2"])}
...     ad1 = AnnData(df, obs=obs, var=var, obsm=obsm)
...     var2 = pd.DataFrame(["d", "e", "g"], index=df.columns, columns=["Feat"])
...     obs2 = pd.DataFrame(["C", "D"], index=df.index, columns=["Obs"])
...     ad2 = AnnData(df, obs=obs2, var=var2)
...     return mudata.MuData({'mod1': ad1, 'mod2': ad2})
... 
>>> 
>>> test_data = test_mudata()
/home/di/code/openpipeline/.venv/lib/python3.10/site-packages/mudata/_core/mudata.py:491: UserWarning: Cannot join columns with the same name because var_names are intersecting.
  warnings.warn(
>>> test_data.write_h5mu("test.h5mu")
/home/di/code/openpipeline/.venv/lib/python3.10/site-packages/mudata/_core/mudata.py:491: UserWarning: Cannot join columns with the same name because var_names are intersecting.
  warnings.warn(
>>> 
>>> test_getting_modality = mudata.read("test.h5mu/mod1")
>>> test_getting_modality.obs["test"] = pd.Series(["pekkie", "flip"], name="test_col", index=pd.Index(["obs1", "obs2"]))
>>> mudata.write_h5ad("test.h5mu", mod="mod1", data=test_getting_modality)
>>> 
>>> test_result_of_alteration = test_getting_modality = mudata.read("test.h5mu")
/home/di/code/openpipeline/.venv/lib/python3.10/site-packages/mudata/_core/mudata.py:491: UserWarning: Cannot join columns with the same name because var_names are intersecting.
  warnings.warn(
>>> test_result_of_alteration.obs
     mod1:Obs mod1:sample_id mod1:test mod2:Obs
obs1        A        sample1    pekkie        C
obs2        B        sample2      flip        D
DriesSchaumont commented 1 year ago

One caveat is that the compression of the output files cannot be changes without reading in the whole file. This would mean that we render the --compression arguments useless. As an alternative a compression component could be implemented?