microbiomedata / nmdc-server

Data portal client and server for NMDC.
https://data.microbiomedata.org
Other
8 stars 0 forks source link

`berkeley` update logic for bar and upset charts to account for parthood modeling for DataGeneration #1365

Open aclum opened 2 weeks ago

aclum commented 2 weeks ago

One of the changes in modeling in berkeley was to support projects that were had multiple instrument runs (of the same type) which should be analyzed together. Records are linked with part_of, for the purposes of counts we should only count this as one DataGeneration subclass record. I believe based on the current logic it would count this as three records. simplified example: { "data_generation_set": [ { "id": "nmdc:dgns-99-zUCd5N", "type": "nmdc:NucleotideSequencing", "analyte_category": "metagenome", "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D -run 1", "has_input": [ "nmdc:bsm-00-red" ], "has_output": [ "nmdc:dobj-00-9n9n9n" ], "associated_studies": [ "nmdc:sty-00-555xxx" ], "part_of": [ "nmdc:dgns-22-444xxx" ] }, {"id": "nmdc:dgns-99-zUCd5Z", "type": "nmdc:NucleotideSequencing", "analyte_category": "metagenome", "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D -run 2", "has_input": [ "nmdc:bsm-00-red" ], "has_output": [ "nmdc:dobj-00-9n9n9z" ], "associated_studies": [ "nmdc:sty-00-555xxx" ], "part_of": [ "nmdc:dgns-22-444xxx" ] }, {"id": "nmdc:dgns-22-444xxx", "type": "nmdc:NucleotideSequencing", "analyte_category": "metagenome", "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D", "has_input": [ "nmdc:bsm-00-red" ], "associated_studies": [ "nmdc:sty-00-555xxx" ] } ] }

you have to check that each record doesn't have a parent since not having part_of can mean either it doesn't have a parent or it is the parent record. cc @naglepuff

Desired behavior to be discussed at Monday checkin meeting. cc @SamuelPurvine @kheal @mslarae13

SamuelPurvine commented 2 weeks ago

Interesting! So, currently the proteomics mass spec data is one-to-one for omprc to mass spec data object, and there's migrations planned but I'm unsure what those migrations will end up looking like. In the above paradigm we will make one new DataGeneration class entry and associate a "bunch" of MassSpectrometry instances to that, where the MassSpectrometry carries with it the 1) has_input (Biosample or processed sample), 2) has_output (DataObject with the related raw file stuff) and 3) the part_of which will reference the DataGeneration entry, yah? I've not looked at the migrator to see if that's how it will work, nor if we have the biosample or processed sample IDs to match up to the has_input.

Anyway, @pdpiehowski this structure would be highly useful in grouping mass spec proteomics data (Workflow outputs) into a single "experiment" (loosely defined here as a collection of sample derived data that should be grouped to allow biological interpretations to occur) for follow on aggregated analyses, and also make a natural hurdle for grouping MassSpectrometry entries that shouldn't be (different instruments, different collection times, different institutions, etc.), wherein care must be taken when doing so.

kheal commented 2 weeks ago

@aclum, thank you for tagging me. This modeling is really important for lipidomics and metabolomics analyses (for similar reasons to what @SamuelPurvine mentions) and I didn't know about it! Our current lipid analysis is 1:1, but I have a batch analysis pipeline in the works that would rely on this part_of connection that should (in theory) result in better annotation coverage, so I'll be sure to populate that slot when the time comes.

I don't actually understand the issue itself, but if you think I can be of use, please forward the Monday meeting you speak of to me and I'll try to contribute where I can.

pdpiehowski commented 2 weeks ago

Great, I can't say that I fully understand either, but it sounds like we can save some major headaches down the road.

aclum commented 2 weeks ago

monday meeting is the rollout lead meeting so hopefully you'll be there @kheal

kheal commented 2 weeks ago

@aclum

Ok, I think I understand this more. I don't think your third entry should be modeled as a DataGeneration (which, by definition is "A DataGeneration in which the sequence of DNA or RNA molecules is generated"). Instead, this represents a group of DataGeneration instances, correct? It is impossible to tell that from the record itself and I find that problematic.

Could we use the part_of so the daughters reference eachother eliminating the need for a third, "shadow" instance? That would eliminate the logic gymnastics for the data portal and each DataGeneration instance would actually represent an instance of generating data. I do not love the name of the slot and I'd be open to changing it, but that is how I would use this slot for MassSpectrometry instances - the concept of a parent/children doesn't translate to MassSpectrometry the same way you have modeled and have explained. Depending on our desired plotting on the upset plots, we can collapse these by looking for unique has_input:analyte_category which would work for lipids too (which will have 2 DataGeneration instances for each Biosample, but they are not combined like the sequencing data are).

I think another option would be to create a separate class for these groups of DataGeneration instances if we feel like they themselves need a class of their own for some reason. I feel pretty strongly that we should not be using the DataGeneration to model anything that does not actually represent an individual instrument run, because then our modeling really becomes shaky.

Happy to talk more.

mslarae13 commented 2 weeks ago

@aclum this is required for berk, right? Is it MVP?

aclum commented 2 weeks ago

Depends on the discussion on Wednesday. If we keep the existing modeling I would like this to be part of the MVP even if the ingest code isn't ready to handle it.

turbomam commented 2 weeks ago

@aclum 's inital dataset, in YAML for those who prefer it

---
data_generation_set:
- id: nmdc:dgns-99-zUCd5N
  type: nmdc:NucleotideSequencing
  analyte_category: metagenome
  name: Thawing permafrost microbial communities from the Arctic, studying carbon
    transformations - Permafrost 712P3D -run 1
  has_input:
  - nmdc:bsm-00-red
  has_output:
  - nmdc:dobj-00-9n9n9n
  associated_studies:
  - nmdc:sty-00-555xxx
  part_of:
  - nmdc:dgns-22-444xxx
- id: nmdc:dgns-99-zUCd5Z
  type: nmdc:NucleotideSequencing
  analyte_category: metagenome
  name: Thawing permafrost microbial communities from the Arctic, studying carbon
    transformations - Permafrost 712P3D -run 2
  has_input:
  - nmdc:bsm-00-red
  has_output:
  - nmdc:dobj-00-9n9n9z
  associated_studies:
  - nmdc:sty-00-555xxx
  part_of:
  - nmdc:dgns-22-444xxx
- id: nmdc:dgns-22-444xxx
  type: nmdc:NucleotideSequencing
  analyte_category: metagenome
  name: Thawing permafrost microbial communities from the Arctic, studying carbon
    transformations - Permafrost 712P3D
  has_input:
  - nmdc:bsm-00-red
  associated_studies:
  - nmdc:sty-00-555xxx
turbomam commented 2 weeks ago

related slides: 20240826 Status of modeling when instrument runs need to be combined

aclum commented 2 weeks ago

blocked until we decide if we are going to keep existing modeling @naglepuff

aclum commented 21 hours ago

modeling was rolled back, no action until we have new modeling.