Open aclum opened 2 weeks ago
Interesting! So, currently the proteomics mass spec data is one-to-one for omprc to mass spec data object, and there's migrations planned but I'm unsure what those migrations will end up looking like. In the above paradigm we will make one new DataGeneration class entry and associate a "bunch" of MassSpectrometry instances to that, where the MassSpectrometry carries with it the 1) has_input (Biosample or processed sample), 2) has_output (DataObject with the related raw file stuff) and 3) the part_of which will reference the DataGeneration entry, yah? I've not looked at the migrator to see if that's how it will work, nor if we have the biosample or processed sample IDs to match up to the has_input.
Anyway, @pdpiehowski this structure would be highly useful in grouping mass spec proteomics data (Workflow outputs) into a single "experiment" (loosely defined here as a collection of sample derived data that should be grouped to allow biological interpretations to occur) for follow on aggregated analyses, and also make a natural hurdle for grouping MassSpectrometry entries that shouldn't be (different instruments, different collection times, different institutions, etc.), wherein care must be taken when doing so.
@aclum, thank you for tagging me. This modeling is really important for lipidomics and metabolomics analyses (for similar reasons to what @SamuelPurvine mentions) and I didn't know about it! Our current lipid analysis is 1:1, but I have a batch analysis pipeline in the works that would rely on this part_of
connection that should (in theory) result in better annotation coverage, so I'll be sure to populate that slot when the time comes.
I don't actually understand the issue itself, but if you think I can be of use, please forward the Monday meeting you speak of to me and I'll try to contribute where I can.
Great, I can't say that I fully understand either, but it sounds like we can save some major headaches down the road.
monday meeting is the rollout lead meeting so hopefully you'll be there @kheal
@aclum
Ok, I think I understand this more. I don't think your third entry should be modeled as a DataGeneration
(which, by definition is "A DataGeneration in which the sequence of DNA or RNA molecules is generated"). Instead, this represents a group of DataGeneration
instances, correct? It is impossible to tell that from the record itself and I find that problematic.
Could we use the part_of
so the daughters reference eachother eliminating the need for a third, "shadow" instance? That would eliminate the logic gymnastics for the data portal and each DataGeneration
instance would actually represent an instance of generating data. I do not love the name of the slot and I'd be open to changing it, but that is how I would use this slot for MassSpectrometry
instances - the concept of a parent/children doesn't translate to MassSpectrometry
the same way you have modeled and have explained. Depending on our desired plotting on the upset plots, we can collapse these by looking for unique has_input
:analyte_category
which would work for lipids too (which will have 2 DataGeneration
instances for each Biosample
, but they are not combined like the sequencing data are).
I think another option would be to create a separate class for these groups of DataGeneration
instances if we feel like they themselves need a class of their own for some reason. I feel pretty strongly that we should not be using the DataGeneration
to model anything that does not actually represent an individual instrument run, because then our modeling really becomes shaky.
Happy to talk more.
@aclum this is required for berk, right? Is it MVP?
Depends on the discussion on Wednesday. If we keep the existing modeling I would like this to be part of the MVP even if the ingest code isn't ready to handle it.
@aclum 's inital dataset, in YAML for those who prefer it
---
data_generation_set:
- id: nmdc:dgns-99-zUCd5N
type: nmdc:NucleotideSequencing
analyte_category: metagenome
name: Thawing permafrost microbial communities from the Arctic, studying carbon
transformations - Permafrost 712P3D -run 1
has_input:
- nmdc:bsm-00-red
has_output:
- nmdc:dobj-00-9n9n9n
associated_studies:
- nmdc:sty-00-555xxx
part_of:
- nmdc:dgns-22-444xxx
- id: nmdc:dgns-99-zUCd5Z
type: nmdc:NucleotideSequencing
analyte_category: metagenome
name: Thawing permafrost microbial communities from the Arctic, studying carbon
transformations - Permafrost 712P3D -run 2
has_input:
- nmdc:bsm-00-red
has_output:
- nmdc:dobj-00-9n9n9z
associated_studies:
- nmdc:sty-00-555xxx
part_of:
- nmdc:dgns-22-444xxx
- id: nmdc:dgns-22-444xxx
type: nmdc:NucleotideSequencing
analyte_category: metagenome
name: Thawing permafrost microbial communities from the Arctic, studying carbon
transformations - Permafrost 712P3D
has_input:
- nmdc:bsm-00-red
associated_studies:
- nmdc:sty-00-555xxx
blocked until we decide if we are going to keep existing modeling @naglepuff
modeling was rolled back, no action until we have new modeling.
One of the changes in modeling in berkeley was to support projects that were had multiple instrument runs (of the same type) which should be analyzed together. Records are linked with part_of, for the purposes of counts we should only count this as one DataGeneration subclass record. I believe based on the current logic it would count this as three records. simplified example: { "data_generation_set": [ { "id": "nmdc:dgns-99-zUCd5N", "type": "nmdc:NucleotideSequencing", "analyte_category": "metagenome", "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D -run 1", "has_input": [ "nmdc:bsm-00-red" ], "has_output": [ "nmdc:dobj-00-9n9n9n" ], "associated_studies": [ "nmdc:sty-00-555xxx" ], "part_of": [ "nmdc:dgns-22-444xxx" ] }, {"id": "nmdc:dgns-99-zUCd5Z", "type": "nmdc:NucleotideSequencing", "analyte_category": "metagenome", "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D -run 2", "has_input": [ "nmdc:bsm-00-red" ], "has_output": [ "nmdc:dobj-00-9n9n9z" ], "associated_studies": [ "nmdc:sty-00-555xxx" ], "part_of": [ "nmdc:dgns-22-444xxx" ] }, {"id": "nmdc:dgns-22-444xxx", "type": "nmdc:NucleotideSequencing", "analyte_category": "metagenome", "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D", "has_input": [ "nmdc:bsm-00-red" ], "associated_studies": [ "nmdc:sty-00-555xxx" ] } ] }
you have to check that each record doesn't have a parent since not having part_of can mean either it doesn't have a parent or it is the parent record. cc @naglepuff
Desired behavior to be discussed at Monday checkin meeting. cc @SamuelPurvine @kheal @mslarae13