microbiomedata / issues

public repo for issues related to NMDC work
2 stars 1 forks source link

Draft: Where should reference vs. usage data be stored? #630

Open turbomam opened 8 months ago

turbomam commented 8 months ago

This morning I am feeling very inspired by some work that @anastasiyaprymolenna , @sierra-moxon and others were doing last night. I'm talking about reviewing a PR I made for simplifying ChemicalConversionProcess, especially identifying where reference knowledge about chemical substances should live.

For an Instrument indexed by it's serial number, reference knowledge might be the vendor name, the model number, and/ or the smallest and largest entities it can detect in a given configuration. For a Substance, the reference knowledge would include the SMILES and InChI representations, molecular weight, etc. Those resources can be used in different ways, but some things hold constant for them no matter how they are used.

This issue is currently a draft. I will be adding clarification and examples. That doesn't mean you can't comment on it now or ask questions.

An initial assumption: we will not allow totally unconstrained specifications of Instruments, Substances, organizations, etc. That means the primary key for those instances won't be a string slot, or even an unconstrained CURIe slot.

Some options for storing reference knowledge:

  1. Put all knowledge about a thing in in-lined instances (like Substances) that are embedded in some other instances (like ChemicalConversionProcesses) which have an identifier and can be found in a MongoDB collection. This is the most verbose and free-form solution. It will almost certainly lead to duplicate and /or contradictory information.
  2. Put all reference knowledge about something into annotations on permissible values in an enumeration, then mention that permissible value in classes with an identifier. That class could also reflect the way that some thing (as specified by a permissible value) was used, like the concentration, volume or mass of a substance. This solution requires searches to traverse both the data in MongoDB and the enumeration(s) in the schema.
  3. Save the reference knowledge as instances of some class in MongoDB. It would be a new collection like substance_set. berkeley-schema-fy24 is already moving in that direction for Instruments. In this case, searches would only need to traverse MongoDB
  4. Save the reference knowledge in the nmdc-ontology. One justification for this is that there doesn't appear to be any term, in any of the ontologies or databases that we prioritize, that uniquely identifies de-ionized water. We could create a term in nmdco ,at least as a stopgap, and then shop it around to other ontologies we work with.

The crux issue:

We need to support searches that include general/common/trivial terms like "find LC/MS proteomics data where the proteins were digested in the presence of the detergent tween". We know that tween comes in several formulations with different characteristics (and maybe shouldn't even be put anywhere near a LC/MS system!). So we may be tempted to say that we must have multiple entities in our system (using one of the storage solutions above) that all use tween as the primary key because that's what the user will query for.

Of course your can't have multiple identical primary keys in a data storage system, and that may be seen as justification to bundle the reference data (like InChI specifications) in the in-lined instances, option 1.

I am taking the position that reference knowledge should be separated from usage instances, and that the reference knowledge records must use distinct, un-ambiguous identifiers, If several of these entities share trivial names, then those should be aliases. At this point it becomes the responsibility of the user interface (a website, API, etc.) to support searches that account for ambiguity. If the reference knowledge is stored intelligently, then the development of search tools shouldn't be too difficult.

Ideally, the unique identifiers would be brief and use a small alphabet, like letters, numbers and minimal punctuation. That means that InChI strings or keys wouldn't be good identifiers. CAS has the most identifiers of any chemical database or registry, but it has a limited access policy. I am leaning strongly towards specifying reference knowledge about Substances in the nmdc-ontology. We would prioritize importing entities from established semantic resources, but create terms when that is really needed. We could annotate those terms, even when they are imported, with whatever additional names and identifiers we want. And users can access the nmdc-ontology via Bioportal or by downloading it from GitHub.


Note that I am not addressing "workflow data" here, like FASTQ inputs or GFF outputs. I would like to see or create a document that explains where these live (mostly as files on NERSC?) and how to tie them into schema-compliant records in MongDB (DataObject URLs?)

anastasiyaprymolenna commented 8 months ago

Related to ChemicalConversionProcess refactoring https://github.com/microbiomedata/nmdc-schema/issues/1842

turbomam commented 8 months ago

Related to ChemicalConversionProcess refactoring microbiomedata/nmdc-schema#1842

Yes, among other modeling patterns that involve reference and usage information

mslarae13 commented 7 months ago

@turbomam This related to ChemicalConversionProcess. seems like the remainder of this issue is a "thought experiment" .

I think the requirement for Metabolink2 is done. Will remove from squad board

mslarae13 commented 6 months ago

Won't be resolved as part of the metabolink 2 squad. Moving this out of the task.