Dimension constraints - Githubissues

ivirshup commented 2 years ago

The issue

A MuData object can hold a collection of AnnData objects that are related in a diverse set of ways. E.g.

It can hold measurements of multiple variable modalities on a shared group of observations
Each AnnData could have the same set of variables, but be from a different set of observations
The observations and variables could be completley disjoint sets

Would it be useful to be able to specify constraints on these dimensions for a particular MuData object? E.g.

Contains multimodal measurements on a shared set of cells. var_names must not overlap between AnnData objects and obs_names must have a non-empty intersection (or even be exactly the same)
Each AnnData is RNA seq on a different set of cells – obs_names may not overlap, and var_names must be contained in a fixed annotation set

Why would this be useful

I think this would be useful for both speed and semantics.

Semantics

A method may require modalities matched on a fixed set of cells. If you have repeating modalities or disjoints sets of cells this is invalid.
I don't need to check and infer that a MuData object represents a single set of observations on multiple modalities by checking the values in the dimensions. I can just check the constraints.
Check out references below for more "semantics" use-cases.

Speed

If I know that variables cannot overlap, I need to check fewer places for them. If I know a label sets cannot overlap, and I know each set of labels contains only unique values, I do not need to check for overlaps at all.

Describe alternatives you've considered

How do we know our invariants hold? I think we'd need to be sure we control the operations that could make them fail. For example, we would somehow need to restrict assignment of .obs_names or .var_names to individual AnnData objects contained in a MuData.

It could also be that cases with constraints are worth totally separate classes.

Additional context

This stems somewhat from a conversation with @ambrosecarr around common data models for single cell data. The idea being that a collection holding a "raw" and "filtered" dataset would be pretty close in functionality to holding multiple modalities.

However, subsets of one set of variables is semantically quite different from disjoint sets of variables. I think it could be quite useful to know which case one was dealing with.

(Transposing the dimensions here – since AnnData is symmetric like that – we get the cases of overlapping and disjoint sets of observations)

Multiple imaging data libraries define container types for collections of images:

Each image in OME-NGFF get’s a group at the root, and tables are specific to a group
- This might be different for high-throughput plates, not totally sure
FISHscale uses a MultiDataset object
PathML uses a SlideDataset for multiple slides

These are multiple objects with disjoint sets of observations.

A while ago now (dask-distributed) I talked about related topics with @joshmoore. It looks like there may be similar discussions around xarray and ome-ngff:

https://github.com/pydata/xarray/issues/4118 (good discussion on various use cases, has spawned some repos and grants)
https://github.com/zarr-developers/zarr-specs/issues/125
It's mentioned quite a bit for OME-NGFF, but I haven't picked out a great exemplar issue yet.

gtca commented 2 years ago

First of all, thanks for opening this issue – and for the information provided!

A 2D data structure design is something I've been experimenting with but at this point I am not sure how to properly address this[^1].

As a first step, I would consider making it possible to create a MuData object with a different axis along which AnnData objects are combined (i.e. now it's axis=1). The current implementation of MuData should support it, we have removed this possibility from the interface however at the moment.

This would cover the first two points from your list, and that's what I would concentrate on having implemented.

That being said, one can create a MuData object with any AnnData objects, the question is what to do with it then...

Implementation-wise, the constraints way sounds like an appropriate way to address it, and conceptually it sounds similar to the constraints AnnData has for the aligned objects, which is probably a good thing. Not sure how it would look like, I would start with adding an axis= argument upon MuData object construction that would also set a respective property of the MuData. It might be that this is going to be enough.

Another important question though is how to work with such objects. E.g. muon.tl.mofa() can take sample (cell) grouping into account, in which case we would just need to check for the axis, however other methods might be essentially undefined for this problem, e.g. WNN. However, considering that the main focus of muon is multimodal datasets, that shouldn't be of an issue not to have support of unimodal MuData objects in methods like this.

[^1]: I can see an extended MuData specification which can contain MuData objects as modalities. In fact, technically, one can already do that! It might be that this is going to be a much more straightforward design than complicating MuData specification significantly, taking the rest of the ecosystem into consideration,.

gtca commented 2 years ago

Just to keep in touch about this enhancement, @ivirshup, there's an axis attribute that has been added in 422d5c9d4a4e8554850c8df53f5d8cf5e28f6d52. This will allow us to build an interface on top of it.

import numpy as np
from mudata import AnnData, MuData

adata = AnnData(np.random.normal(size=(20, 10)))
mdata = MuData({"dataset1": adata, "dataset2": adata.copy()}, axis=1)
mdata.obs_names_make_unique()
mdata.shape
# => (40, 20)

grst commented 1 year ago

I've just been reviewing the new axes convention tutorial and it made me wonder if the axis attribute could ever require the use of nested mudata objects. I.e. that instead of holding only AnnData objects, a MuData object could hold another MuData object.

Let's consider the following case:

RNA-seq and protein data with shared cells, but not variables (axis=0)
of the RNA-seq data, there are multiple views, e.g. different preprocessing (axis=-1)

Even though I didn't have a strong desire to do this in practice yet, I think it could make sense to represent this as follows

root MuData (axis=0) (5000 x 20050)
├── protein AnnData (5000 x 50)
└── rna MuData (axis=-1)
    ├── raw AnnData (5000 x 20000)
    ├── qc'ed AnnData (3000 x 20000)
    └── hvg-filtered AnnData (3000 x 4000)

gtca commented 1 year ago

I believe this has already been functional — I updated the text representation a bit. A clearer hierarchy in the representation as you described, @grst, could also be a good first issue!

scverse / mudata

Dimension constraints #13