single-cell-data / SOMA

A flexible and extensible API for annotated 2D matrix data stored in multiple underlying formats.
MIT License
72 stars 10 forks source link

Alignment with FOM schema #21

Closed joshua-d-campbell closed 1 year ago

joshua-d-campbell commented 2 years ago

I just wanted to follow up from a conversation from #11 and discuss integration of the FOM schema with this effort as there are probably both synergies and overlaps in scope. I am thinking of renaming FOM to something like “Matrix and Analysis Metadata Standards (MAMS)” because the primary purpose (at least for the first half of the working group) was to develop metadata fields (i.e. a database-like schema) that can be used to describe the data contained within matrices, how they relate to each other, and provenance of what tools/functions created them. This includes fields for the data matrices (X or FOM), annotation matrices (var/obs or FAM/OAM), ID arrays/matrices (OID/FID), and potentially graph data.

To probably oversimplify things a bit, my initial impressions are that the matrix-api schema uses a combination of a “directory-like” structures and ID labels (e.g. “RNA.raw”) to encode the information about relationships between matrices and what each matrix is within a dataset. Using hierarchical directory-like structures to define groups is intuitive and similar to the way toolchain objects arrange their data so it can make sense to do that here. As an aside, minor limitations may be that a matrix can only “live” in one group and there may be multiple ways to define the hierarchy. A database-like schema with metadata tags can be a bit less intuitive but potentially more flexible (e.g. can define many-to-many relationships) and extensible to new situations (don’t have to redefine the hierarchy with new scenarios). I’m sure that there are other upsides/downsides to each, but they are probably not mutually exclusive either.

There are some simple potential synergies. Some of the ID labels that are being proposed in the matrix api are directly related to FOM fields. RNA.raw is just a combination of two FOM fields “analyte” and “processing” and could be referenced in the matrix api docs with something like analyte.processing. Referring to these fields and encouraging (but not forcing) people to use the suggested ontologies for these fields will ultimately help integration across datasets and probably make the api more useful. The one place where this a potential discrepancy with the current description is RNA.filtering. FOM currently has separate fields describing the level of filtering that has been done to the observations or features (obs_subset, feature_subset) and this is distinct from the processing field which describes what type of data is in the matrix. These fields might be useful for naming different sc_groups. There have been some grouping-like metadata tags drafted in the FOM schema. I’m debating on removing these and leaving that for the implementations like matrix-api.

I am hoping that the metadata fields defined in FOM can be stored in the various objects (maybe in __meta). This may also help with cross dataset queries. For example, getting the rna.clean.counts matrix from datasets A, B, and C will to start integrative analyses likely be easier if tags are standardized and stored in a consistent way.

Sorry for the long post, but it would be great to get other’s thoughts at some point.

falexwolf commented 2 years ago

I don't entirely understand everything in the previous as I lack knowledge of the FOM schema, but I agree on it being a great idea to write down examples of the interplay between matrix-api and the FOM schema.

Maybe one could start documenting some examples in this issue, and then point out potentials for improvements? Right now, the only major question seems how to represent filtering in a way that plays well with the current matrix-api specification and the FOM schema.

As filtering is a very important basic question, I'd love us to resolve it either by adapting matrix-api or the FOM schema.

Btw, I'd be happy about the MAMS abbreviation and corresponding long form! I think it's much easier understandable than "FOM".

ambrosejcarr commented 1 year ago

Closing as outdated, but happy to revisit if there continues to be interest.