Open turbomam opened 7 months ago
Related to ChemicalConversionProcess refactoring https://github.com/microbiomedata/nmdc-schema/issues/1842
Related to ChemicalConversionProcess refactoring microbiomedata/nmdc-schema#1842
Yes, among other modeling patterns that involve reference and usage information
@turbomam This related to ChemicalConversionProcess. seems like the remainder of this issue is a "thought experiment" .
I think the requirement for Metabolink2 is done. Will remove from squad board
Won't be resolved as part of the metabolink 2 squad. Moving this out of the task.
This morning I am feeling very inspired by some work that @anastasiyaprymolenna , @sierra-moxon and others were doing last night. I'm talking about reviewing a PR I made for simplifying ChemicalConversionProcess, especially identifying where reference knowledge about chemical substances should live.
For an
Instrument
indexed by it's serial number, reference knowledge might be the vendor name, the model number, and/ or the smallest and largest entities it can detect in a given configuration. For aSubstance
, the reference knowledge would include the SMILES and InChI representations, molecular weight, etc. Those resources can be used in different ways, but some things hold constant for them no matter how they are used.This issue is currently a draft. I will be adding clarification and examples. That doesn't mean you can't comment on it now or ask questions.
An initial assumption: we will not allow totally unconstrained specifications of
Instrument
s,Substance
s, organizations, etc. That means the primary key for those instances won't be a string slot, or even an unconstrained CURIe slot.Some options for storing reference knowledge:
Substance
s) that are embedded in some other instances (likeChemicalConversionProcess
es) which have an identifier and can be found in a MongoDB collection. This is the most verbose and free-form solution. It will almost certainly lead to duplicate and /or contradictory information.substance_set
. berkeley-schema-fy24 is already moving in that direction for Instruments. In this case, searches would only need to traverse MongoDBThe crux issue:
We need to support searches that include general/common/trivial terms like "find LC/MS proteomics data where the proteins were digested in the presence of the detergent tween". We know that tween comes in several formulations with different characteristics (and maybe shouldn't even be put anywhere near a LC/MS system!). So we may be tempted to say that we must have multiple entities in our system (using one of the storage solutions above) that all use tween as the primary key because that's what the user will query for.
Of course your can't have multiple identical primary keys in a data storage system, and that may be seen as justification to bundle the reference data (like InChI specifications) in the in-lined instances, option 1.
I am taking the position that reference knowledge should be separated from usage instances, and that the reference knowledge records must use distinct, un-ambiguous identifiers, If several of these entities share trivial names, then those should be aliases. At this point it becomes the responsibility of the user interface (a website, API, etc.) to support searches that account for ambiguity. If the reference knowledge is stored intelligently, then the development of search tools shouldn't be too difficult.
Ideally, the unique identifiers would be brief and use a small alphabet, like letters, numbers and minimal punctuation. That means that InChI strings or keys wouldn't be good identifiers. CAS has the most identifiers of any chemical database or registry, but it has a limited access policy. I am leaning strongly towards specifying reference knowledge about
Substance
s in the nmdc-ontology. We would prioritize importing entities from established semantic resources, but create terms when that is really needed. We could annotate those terms, even when they are imported, with whatever additional names and identifiers we want. And users can access the nmdc-ontology via Bioportal or by downloading it from GitHub.Note that I am not addressing "workflow data" here, like FASTQ inputs or GFF outputs. I would like to see or create a document that explains where these live (mostly as files on NERSC?) and how to tie them into schema-compliant records in MongDB (DataObject URLs?)