scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
72 stars 17 forks source link

Dimension constraints #13

Closed ivirshup closed 1 year ago

ivirshup commented 2 years ago

The issue

A MuData object can hold a collection of AnnData objects that are related in a diverse set of ways. E.g.

Would it be useful to be able to specify constraints on these dimensions for a particular MuData object? E.g.

Why would this be useful

I think this would be useful for both speed and semantics.

Semantics

Speed

If I know that variables cannot overlap, I need to check fewer places for them. If I know a label sets cannot overlap, and I know each set of labels contains only unique values, I do not need to check for overlaps at all.

Describe alternatives you've considered

How do we know our invariants hold? I think we'd need to be sure we control the operations that could make them fail. For example, we would somehow need to restrict assignment of .obs_names or .var_names to individual AnnData objects contained in a MuData.

It could also be that cases with constraints are worth totally separate classes.

Additional context

This stems somewhat from a conversation with @ambrosecarr around common data models for single cell data. The idea being that a collection holding a "raw" and "filtered" dataset would be pretty close in functionality to holding multiple modalities.

However, subsets of one set of variables is semantically quite different from disjoint sets of variables. I think it could be quite useful to know which case one was dealing with.

(Transposing the dimensions here – since AnnData is symmetric like that – we get the cases of overlapping and disjoint sets of observations)


Multiple imaging data libraries define container types for collections of images:

These are multiple objects with disjoint sets of observations.


A while ago now (dask-distributed) I talked about related topics with @joshmoore. It looks like there may be similar discussions around xarray and ome-ngff:

gtca commented 2 years ago

First of all, thanks for opening this issue – and for the information provided!

A 2D data structure design is something I've been experimenting with but at this point I am not sure how to properly address this[^1].

As a first step, I would consider making it possible to create a MuData object with a different axis along which AnnData objects are combined (i.e. now it's axis=1). The current implementation of MuData should support it, we have removed this possibility from the interface however at the moment.

This would cover the first two points from your list, and that's what I would concentrate on having implemented.

That being said, one can create a MuData object with any AnnData objects, the question is what to do with it then...


Implementation-wise, the constraints way sounds like an appropriate way to address it, and conceptually it sounds similar to the constraints AnnData has for the aligned objects, which is probably a good thing. Not sure how it would look like, I would start with adding an axis= argument upon MuData object construction that would also set a respective property of the MuData. It might be that this is going to be enough.

Another important question though is how to work with such objects. E.g. muon.tl.mofa() can take sample (cell) grouping into account, in which case we would just need to check for the axis, however other methods might be essentially undefined for this problem, e.g. WNN. However, considering that the main focus of muon is multimodal datasets, that shouldn't be of an issue not to have support of unimodal MuData objects in methods like this.

[^1]: I can see an extended MuData specification which can contain MuData objects as modalities. In fact, technically, one can already do that! It might be that this is going to be a much more straightforward design than complicating MuData specification significantly, taking the rest of the ecosystem into consideration,.

gtca commented 2 years ago

Just to keep in touch about this enhancement, @ivirshup, there's an axis attribute that has been added in 422d5c9d4a4e8554850c8df53f5d8cf5e28f6d52. This will allow us to build an interface on top of it.

import numpy as np
from mudata import AnnData, MuData

adata = AnnData(np.random.normal(size=(20, 10)))
mdata = MuData({"dataset1": adata, "dataset2": adata.copy()}, axis=1)
mdata.obs_names_make_unique()
mdata.shape
# => (40, 20)
grst commented 1 year ago

I've just been reviewing the new axes convention tutorial and it made me wonder if the axis attribute could ever require the use of nested mudata objects. I.e. that instead of holding only AnnData objects, a MuData object could hold another MuData object.

Let's consider the following case:

Even though I didn't have a strong desire to do this in practice yet, I think it could make sense to represent this as follows

root MuData (axis=0) (5000 x 20050)
├── protein AnnData (5000 x 50)
└── rna MuData (axis=-1)
    ├── raw AnnData (5000 x 20000)
    ├── qc'ed AnnData (3000 x 20000)
    └── hvg-filtered AnnData (3000 x 4000)
gtca commented 1 year ago

I believe this has already been functional — I updated the text representation a bit. A clearer hierarchy in the representation as you described, @grst, could also be a good first issue!