single-cell-data / SOMA

A flexible and extensible API for annotated 2D matrix data stored in multiple underlying formats.
MIT License
69 stars 9 forks source link

Identify nomenclature that unambiguously describes the components of matrix-api objects #11

Closed ambrosejcarr closed 1 year ago

ambrosejcarr commented 2 years ago

This issue identifies confusion with the sc-group and sc-dataset nomenclature and contains a discussion about potential replacement names and how those names should be propagated to the broader feature-object-matrix standardization efforts.

ivirshup commented 2 years ago

I find the current nomenclature (sc-group :: AnnData, sc-dataset :: MuData) a little confusing.

I'd thought the names were switched, since sc-group sounds more plural than sc-dataset to me. There's also some ambiguity when multiple studies are collected together, since dataset could refer to either. One suggestion would be sc-collection for the plural.

falexwolf commented 2 years ago

I also stumbled across the naming when first reading it in the Google Doc proposal and made a comment. Months later, I personally got used to it and can't think of a better alternative.

I do think it'd be helpful to mention the naming convention translation (sc_group~AnnData~assay~..., sc_dataset~MuData~...) in the docs for those readers familiar with this convention, and introduce these as the most high-level terms first in the specification of the schema.

Here comes a suggestion along these lines: https://github.com/single-cell-data/matrix-api/pull/17

ivirshup commented 2 years ago

I still feel like using group for singular and dataset for plural can be confusing.

Especially when HDF5 and zarr, which one might refer to in the same breath, define Groups as a collection of Datasets. I feel like potential confusion here could be avoided by just picking a different name. (sc-group :: AnnData, super-group :: MuData) would also avoid this. Though there is still the issue of "wait, which kind of 'group' do you mean?".

falexwolf commented 2 years ago

I'd be very much open for a different naming choice, but I also think that would require a more in-depth decision doc.

For instance, I'd feel "super-group" bears more potential for confusion in the future.

ambrosejcarr commented 2 years ago

@ivirshup do you think this would be mitigated in practice when schema specifications + group naming is applied in implementations of the schema? I feel they'll make things more concrete.

I'm open to changing the name, but I think Alex's changes in #17 mitigate most of the confusion and wonder if we're in an area of diminishing returns.

As an example, the TileDB implementation we're working on has the following structure. This example models a 10x multiome.

pbmc_small  # sc_dataset
├── __tiledb_group.tdb
└── RNA   # sc_group
   ├──── __tiledb_group.tdb
   ├──── var
   ├──── obs
   ├──── obsm
   ├──── varm
   └──── X
└── ATAC  # sc_group
   ├──── __tiledb_group.tdb
   ├──── var
   ├──── obs
   ├──── obsm
   ├──── varm
   └──── X
ivirshup commented 2 years ago

I think part of the issue is me working with very concrete examples (hdf5 and zarr APIs) where groups contain datasets. The context switch here keeps tripping me up. I'm not sure I've had a day where I've thought about this and not mixed them up at least once.

I don't think this is a huge sticking point for me, but I think we're going to have some confused conversations if the current sc-dataset, sc-group convention stays. If one of these names changes, I think we can avoid that.

falexwolf commented 2 years ago

I understand Isaac's discomfort with the naming choice and have given it more thought to come up with a potentially better choice. To that end, I'll try to offer an additional perspective on the current specification.

What we are discussing here is the naming choice of two "data structures" giving rise to two "data container objects" and their serialized counterparts. Here comes a table with a few naming suggestions. Please note that the new random abbreviation SLAM is just meant to ditch all connotations with existing terms and could be any acronym for that sake.

Data structure Possible names
A simple layered annotated matrix (SLAM) storing measurements of variables. sc_group, sc_matrix, slam, sla_matrix (AnnData)
The overall container structure, a collection of related SLAMs. sc_dataset, sc_collection, sc_data, omics_data, slam_collection, slam_data, sla_collection (MuData)

To jump to my conclusion for selecting names: My strongest inclination is that dataset is indeed too confusing of a term as it has several opposing specific meanings in other instances, like the ones Isaac mentioned.

I'd suggest replacing it with another term. My favorite would be a new term like sla_collection or similar for these reasons:

I'd also suggest replacing sc_group as group has strong opposing connotations both to biologists ("group of cells") and in software (zarr, tiledb, hdf5), where it typically means "collection/folder of arrays=datasets". I'd suggest sla_matrix for similar reasons as above. Of course, the precise name doesn't matter so much, it could be a different acronym. But I think it'd be nice if the name encoded the meaning of the data structure.

I'll leave it with this first take on it.

I'd love to hear more in-depth perspectives on naming rationales for the two data structures! And I'd encourage us to still consider the current names sc_group, sc_dataset placeholders until a few more people have commented. It's a very fundamental decision!


For more background, for anyone who still wants to read on: Here comes an effort to define a SLAM[^1]:

And here comes an effort to define "A collection of related SLAMs."[^4]:

[^1]: A SLAM is a meaningful way of structuring & storing data for contingency reasons and typical downstream use cases (queries & learning algorithms): https://github.com/single-cell-data/matrix-api/issues/3#issuecomment-1061719252 [^2]: The data structure is agnostic to that and would represent such information ("empty cells", "doublets", "dead cells", ...) through annotation. [^3]: Operations on layers of a matrix are particularly simple as dimensions are conserved. Simple aggregations like summing up different layers typically produce meaningful summary statistics (for instance, total counts across spliced and unspliced). [^4]: Modern biology characterizes different types of systems across different types of readout variables ("dimensions of biology", "features of biological systems"). Systems are sourced & treated with different experimental protocols, and variables are measured with different types of readout technologies. Readout data from the measurement device is then processed across different computational workflows to bring them into matrix format. The detailed semantics of the five-tuple ("system, experiment, readout, technology, comp-workflow") is beyond the scope of the present data structures. However, it is clear that it is not matrix-like. The simplest structure that could map arbitrary complexity within the four-tuple is a flat list, where each item in the list represents an observation of the metadata in the four-tuple. In this case, often, the observational unit is a sample or a batch of homogeneously processed samples. Multi-modal measurement technologies allow sharing much or all of the metadata in ("system, experiment") across SLAMs which makes it attractive to store a collection of SLAMs together.

joshua-d-campbell commented 2 years ago

This is something I struggled with in the FOM schema as well. Maybe the term "matrix_group" could refer to a set of matrices that are related and "dataset" could be a collection of matrix groups. If the matrices within a sc_group can have different numbers of features, then it might be better to call it an "obs_group". The only other generic possibility that I can think of is "subset" or "modaility_subset" since each sc_group is a unique combination of modalities, observations, and features.

ambrosejcarr commented 2 years ago

Thanks very much for the clarity about the challenge @ivirshup in-depth decomposition of this question @falexwolf

We've discussed and favor the replacement of _dataset with _collection for the reasons you outline, and agree that it would reduce confusion. We also agree that any of these concepts are too generic by themselves, and require a suffix.

Our discussions lead us to propose two additional requirements on the suffix, since we're going to say it a million times: It should be pronounceable, and spellable. slam_ would meet these requirements, sla_ would not.

Our concern with slam is whether "matrix" is unnecessarily limiting. I've thought about it and don't see any rationale for > 2 dimensions (observations, features). We could have made the decision to create a tensor instead of the sc_dataset / sc_group division but because each group is separately processed and filtered, we felt that would introduce strange dependent sub-layers into the object, so did not use that approach. It would also have been harder to transform into toolchain models.

I'm interested to know what level of confidence @falexwolf and @ivirshup have that "matrix" is not overly limiting. If it's not, then slam and slam_collection are reasonable.

We're also interested in ideas for alternative acronyms, and agree we should let this issue germinate a bit before making the change.

joshua-d-campbell commented 2 years ago

Sorry, I didn't refresh to see @falexwolf's last post before adding my own. I also like slam as a potential name. The "simple" does not seem overtly necessary to me (I'm not sure what a complex lam would be at this stage), so it could be shortened to lam to get it to 3 letters like many file formats. It could also be reversed to mal (Matrix with Annotations and Layers).

I don't really have a strong opinion, but I do wonder whether it is worth aligning the name with FOM working group. The FOM (feature observation matrix) acronym was thought up quickly for a grant last summer to get the working group going. I asked during our first call if anyone had other thoughts but didn't get much of a response. slam, fom, and matrix-api may all be different things and should have different names, but I get worried that groups we are trying to work with will get confused if it all sounds too similar, especially if the differences between all of them are subtle.

falexwolf commented 2 years ago

I strongly agree with @joshua-d-campbell's point on aligning FOM vs. matrix-api vs. the data structure name.

To @joshua-d-campbell's point about "simple": I think people might (and I know did) construct matrices by joining, for instance, RNA & ATAC measurements into one contingent table. To me that's then no longer a simple matrix in this sense:

"Simple matrix" means no additional structure neither on observations nor variables, i.e., observations & variables have the same type & unit.

I'd discourage storing "such a complicated object" over a "simple one" as users would have a hard time understanding it.

I agree with @joshua-d-campbell that we don't need to have "simple" in the name. A specification in the docs that discourages storage of intermediate-stage-processed RNA & ATAC could address the above concern![^1] I think that having or having not "simple" in the name is more an aesthetic consideration.

To @ambrosejcarr's point of whether the term "matrix" is limiting.

My thinking was as follows: Yes, a "data matrix" is an array of dimension 2 ("tensor of degree 2"), with index-identifications observations ⨉ features.

There may be measurements that generate data in meaningful higher-dimensional structures that could be worth absorbing in a higher-dimensional array.

The only candidate I can presently think of is spatial data. But to take away the conclusion: I think also spatial data should be stored in annotated-matrix form within the scope of the present data structure specification. Below comes why I think so.

I'm not an expert, but as far as I know, there are two common ways of representing spatial information:

  1. Storing it annotated-matrix-like: each observation is annotated with a 2 or 3-dimensional spatial coordinate. There is an arbitrarily high number of "primary/molecular measurement dimensions", e.g., genes from the whole transcriptome, a few selected genes/stains/probes, an arbitrary number of colors, etc. These measurement dimensions are often called channels.
  2. Storing it as a collection of images: many devices - including cameras, medical imaging, high-throughput microscopes, etc. - produce data that is stored as a folder of image files. In regular image files, every pixel comes with measurements in four channels (RGBA). In assays like cell painting and related it may be many more channels, where each channel has a biological meaning according to the stain/probe. After processing a folder of image files, for regular images, to my knowledge, it's common practice to store them as 4- or 5-dimensional arrays/tensors with this layout (observation, x_index, y_index, z_index, channel_index). To arrive there, at least some of resolution adjustment, interpolation, reshaping, and index definition was applied to define the data on regular 2d or 3d grids. Hence, likely, this tensor doesn't contain raw data anymore but (coarse-grained) interpolated values. The big advantage is that the data is now ready for 2d- and 3d convolution. In a flattened layout where (x_index, y_index, z_index) are all adjacent in some form, this wouldn't be the case.

We see things got a little complicated in 2!

I generally wonder whether one shouldn't favor 1 as a first "go-to" stored representation, in particular as this also seems what OME now considers (https://github.com/ome/ngff/pull/64). There are many ways of putting measurement data on a grid through reshaping, interpolating, adjusting resolutions, etc. These are highly non-trivial ML-related comp-workflows. Most of them won't fit into the metadata model that we're discussing here.

If people do use imaging data, I'd encourage them to store point measurements that are annotated with spatial information as an intermediate representation after a "folder of files". This would fit the present scope of the present repository and be compatible with the "annotated matrix" layout. Reshaped and further-processed tensor/array like representations of the same data could then be dealt with ML data infrastructure. I'm pretty sure there should be solutions for this but would have to investigate.

This whole discussion is related to how much the present format should be seen as a "canonical intermediate format" for omics data as opposed to a format that can also absorb all potential downstream representations. Given the reasons above (with really the dominant reason being the ability to build powerful metadata-schema specification around a matrix), I think it should be considered the former: Upstream (fastqs, folders of images, etc.) and downstream (highly-processed tensor-like data) formats that are non-annotated-matrix like should be handled elsewhere.

One should maybe stress this: Restricting ourselves to "annotated matrices" does not mean any information loss. It means that in some cases some nd-array structure that could be absorbed within the substructure of a flat "primary molecular/measurement dimension" would have to be flattened. But in my mind,

  1. such substructure will only occur downstream
  2. such substructure would complicate defining a broader metadata schema specification and hence broaden the scope of the present data structure too much

Sorry, I hope this didn't end up being too convoluted. Please let me know if it is and I'll try to streamline the writeup. 😅

[^1]: I'd consider a joint embedding learned across RNA & ATAC again "simple" as its features would correspond to the homogeneous, unstructured output dimensions of a computational model. The computational model absorbed the complexity of dealing with the structure RNA vs. ATAC in the input and subsequent model architecture.

ivirshup commented 2 years ago

I've thought about it and don't see any rationale for > 2 dimensions (observations, features).

One case here is genomics data. 3 dimensional arrays used by sgkit (observation, feature, alleles). My understanding here is that it's important for the central data representation to have ploidy information, but that dimension is dropped for most associated arrays.

For spatial data having pixel and point level representations, I agree these have fairly different use cases. If these representations are in scope, a BED file like representation for ATAC data would be in scope as well.

I do think it's important that data where point and read level information is associated with an annotated matrix is compatible here. But my current thinking is that the solution there is essentially to keep their obs x var annotated matrices compatible with matrix-api sc_groups and "symlink them" into sc_datasets (or whatever the terms will be)

ambrosejcarr commented 2 years ago

@falexwolf this is a great restatement of CZI and TileDB's goals for this project. I think we should port the bolded sections to the readme.

This whole discussion is related to how much the present format should be seen as a "canonical intermediate format" for omics data as opposed to a format that can also absorb all potential downstream representations. Given the reasons above (with really the dominant reason being the ability to build powerful metadata-schema specification around a matrix), I think it should be considered the former: Upstream (fastqs, folders of images, etc.) and downstream (highly-processed tensor-like data) formats that are non-annotated-matrix like should be handled elsewhere.

One should maybe stress this: Restricting ourselves to "annotated matrices" does not mean any information loss. It means that in some cases some nd-array structure that could be absorbed within the substructure of a flat "primary molecular/measurement dimension" would have to be flattened.

@ivirshup The sgkit example is interesting, and a good test case.

... I think we could support these data. You could imagine flattening "allele" and "variant" dimensions to "variant" and annotate each variant with the allele it associates with to ensure no information loss.

This is a very useful alignment discussion. I also think we have our answer: "Matrix" is not overly limiting. Based on this, I think the current proposal is as follows (But please let me know if I've interpreted your comments incorrectly!)

Data structure Proposed name
A simple layered annotated matrix (SLAM) storing measurements of variables. (AnnData) slam
The overall container structure, a collection of related SLAMs. (MuData) slam_collection

I also agree with @joshua-d-campbell and @falexwolf 's proposal to align FOM vs. matrix-api vs. the data structure name.

Once we've aligned on a name, I favor changing both the working group and the specification proposal here.

falexwolf commented 2 years ago

I'm happy we seem so aligned, and good with proceeding like this!

Re @ivirshup's example: Thanks for pointing it out! I agree with Ambrose that one can flatten it without information loss, and also no loss of computational efficiency. Per-variant queries & aggregations are less convenient if there is "no per-variant dimension". Adding that convenience back could be achieved through a "genomics"-accessor that talks to the .var annotations, and presents the data to the user with a variant and allele slot.

I also think that that should be possible without efficiency losses as the 3d (observation, variant, allele) array should look exactly the same in storage as the 2d (observation, variant_allele) array.

ivirshup commented 2 years ago

My impression was they like the 3d representation. Someone from sgkit could probably give more context on this. Maybe @hammer or @tomwhite could answer:

Would referring to your data structure as a "matrix" be fine? For context on this thread, we're trying to figure a shared name for both a single AnnData/ SummarizedExperiment/ MatrixTable and a collection of them. IIRC sgkit doesn't name the singular structure, it just has a set of conventions around an xarray dataset.


From what I know:

I also think that that should be possible without efficiency losses

I'm not sure this is the case. It seems that most of the annotation elements are aligned to either the observation or variant dimensions. I think the semantics stop working nicely when you take the product of variant and allele.

IIRC Hail also does not do this (btw, Hail calls their AnnData-like object a MatrixTable)

the 3d (observation, variant, allele) array should look exactly the same in storage as the 2d

This would depend on the cardinality of the array, right?


I generally don't think it would be that bad to allow X a third dimension. For the right cardinality, you could just think of it as a different dtype that happens to be composed of multiple values.

To me, the constraint on X is that the first dimension is aligned to the observations, and the second is aligned to the variables. Not sure we need to constrain further dimensions.

We would just be limiting how much you can annotate those extra dimensions.

tomwhite commented 2 years ago

My impression was they like the 3d representation. Someone from sgkit could probably give more context on this. Maybe @hammer or @tomwhite could answer:

One of the nice things about xarray and zarr is that the number of dimensions is flexible.

IIRC sgkit doesn't name the singular structure, it just has a set of conventions around an xarray dataset.

That's right, so we tend to use xarray nomenclature. So the whole data structure is an xarray dataset, which xarray defines as "a dict-like container of labeled arrays (DataArray objects) with aligned dimensions."

Would referring to your data structure as a "matrix" be fine? For context on this thread, we're trying to figure a shared name for both a single AnnData/ SummarizedExperiment/ MatrixTable and a collection of them.

As mentioned above the individual items in an xarray dataset are called (data) arrays, but sometimes they might be referred to as matrices.

falexwolf commented 2 years ago

Thanks, @tomwhite!

I also agree with @ivirshup on these three paragraphs:

I generally don't think it would be that bad to allow X a third dimension. For the right cardinality, you could just think of it as a different dtype that happens to be composed of multiple values.

To me, the constraint on X is that the first dimension is aligned to the observations, and the second is aligned to the variables. Not sure we need to constrain further dimensions.

We would just be limiting how much you can annotate those extra dimensions.

This would mean we would continue to build metadata standards that assume the second dimension corresponds to variables. This will imply that in order to make full profit from these standards, data will need to be reshaped into that form.

However, we can also allow further dimensions and postpone a solution for how to treat them in the metadata schema. Probably, this means delaying it for a few years. I think coming up with it will be a substantial challenge.

Regarding the consequences for naming: if the term "matrix" feels too constraining, one could change slam to slaa or slat for either "simple layered annotated array" or "... tensor".

I'd personally feel more comfortable with the narrow "matrix use case" in which users will always exactly get what they expect, and not some surprising 3rd dimension that they'll not know how to query and generally use. I appreciate that a converter for the sgkit needs to be written, then, but I also think these converters will need to be written anyway.

If we achieve an intuitive well-designed canonical matrix-api for many types of biological data, this is both achievable and will bring much value. I think that the downside that this wouldn't be readily applicable to all types of biological data shouldn't let us make the mistake of broadening up the scope so much that we can't precisely formulate it anymore.

Hence, I'd suggest sticking with slam. If anyone feels passionate in the other direction, however, I won't oppose it! ☺️

ambrosejcarr commented 2 years ago

The working proposal for this is slam and slam_collection. @stavrospapadopoulos and I are going to source some wider feedback on it with the intent to get it finalized in the next few weeks. @ivirshup @falexwolf if there's anyone else you'd like to bring in to provide feedback please do so.

@joshmoore I'm interested if you have a perspective as an upstream format maintainer.

joshmoore commented 2 years ago

Starting off with the general group/dataset discussion, I'd just like to :+1: @ivirshup's "it's confusing":

Higher-level / folder-like Lower-level / file-like
zarr-python Group Array (but sometimes Dataset)
h5py Group Dataset
NetCDF Group Variable
Xarray Dataset DataArray (internally Variable)
OME Dataset Image
... ... ...

re: slam -- I like it. I've long searched for something better than HAT to cover both HDF5s & Zarrs, or "hierarchies of annotated tensors". I guess if you ever need the recursiveness, you could go have a slam_tree (or slat? or slamr?!).

re: 2D -- if you decide it is a MUST, "tables" and "dataframes" express it better to me, though this probably is as contentionous as "dataset" and "group".

re: >2D (outside of my wheel house but...) "To me, the constraint on X is that the first dimension is aligned to the observations, and the second is aligned to the variables. Not sure we need to constrain further dimensions. https://github.com/single-cell-data/matrix-api/issues/11#issuecomment-1069152970" is intriguing. I assume xarray-ness would suffice to make these additional, higher-dimension annotations discoverable.

re: OME's table representation (https://github.com/single-cell-data/matrix-api/issues/11#issuecomment-1066529607) -- here we are very much looking to define how to bridge with your work here, so if you decide to do something else, we would likely follow suit. cc @kevinyamauchi

ambrosejcarr commented 2 years ago

Our poll of the FOM group and other stakeholders completed today. The finalist names were FOLD (Feature Observation Layered Data), SLAM (Simple Layered Annotated Matrices), and SOMA (Stack of Matrices, Annotated).

Voting indicated that SOMA was the preferred choice and for this decision we decided that voting would be our decision making approach. The following changes will result from this decision:

Existing name in code New name in code Long form
sc_group soma Stack of Matrices, Annotated (SOMA)
sc_collection soma_collection SOMA collection

Follow-up work: https://github.com/single-cell-data/matrix-api/issues/27

johnkerl commented 1 year ago

Circling back -- this seminal issue was in fact the foundational material behind all of last year's SOMA spec design:

https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md