the nmdc-schema may contain some `Biosample` slots that belong exclusively to the submission-schema

turbomam commented 7 months ago

For example, rna_volume is defined in nmdc-schema and associated with Biosample. We even have some tests for it.

If it is important to capture that information in the SubmissionPortal, so that it can be passed on to user facilities, but it is never going to be saved into MongoDB, maybe it doesn't belong in nmdc-schema?

Right now, I don't think we have a way to introduce slots into submission-schema other than by extraction from nmdc-schema, but that doesn't seem like a big technical challenge.

see this example data file, which I revised in a currently-unmerged PR.

mslarae13 commented 5 months ago

below is a list of slots that are in nmdc schema, but are only needed in the submission portal. These slots need to be store on the submission portal side, and able to be exported for the JGI metadata template. But do not need captured by NMDC

emsl_store_temp | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/emsl.yaml | EMSL | Note -- | -- | -- | -- sample_shipped | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/emsl.yaml | EMSL | sample_type | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/emsl.yaml | EMSL | dna_absorb1 | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | IF we keep it, it should map to processed sample, not biosamples dna_absorb2 | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | IF we keep it, it should map to processed sample, not biosamples dna_collect_site | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | I think we can remove this slot & just put the extension that is selected. so 'soil' but a JGI team member should weigh in dna_cont_type | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | dna_cont_well | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | dna_container_id | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | dna_dnase | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | dna_organisms | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | Remove completely? dna_project_contact | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | Will need for JGI API dna_samp_id | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | this is effectively an alterntaive identifier for a processed sample which is the output of Class Extraction. JGI doesn't really expose these identifiers externally so low priority to keep. Would be worth keeping in the submission portal postgres in case we ever need to pul this in. dna_sample_format | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | not informative for workflows dna_sample_name | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | Would map to processed sample if we do decided to keep it dna_seq_project_pi | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | Will need for JGI API, principal_investigator will be available on OmicsProcessing/DataGeneration dna_seq_project_name | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | Will pull from JGI API dna_volume | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | proposal_dna | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metagenomics.yaml | JGI | Will pull from JGI API dna_concentration | https://github.com/microbiomedata/nmdc-schema/blob/df7bd5e6a31e5c46b9afcd844a66d251e612b2b1/src/schema/basic_slots.yaml#L84 | JGI | Would map to processed sample if we do decided to keep it. Why is it in basic_slots and not jgi_metagenomics.yaml proposal_rna | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | Will pull from JGI API rna_absorb1 | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | IF we keep it, it should map to processed sample, not biosamples rna_absorb2 | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | IF we keep it, it should map to processed sample, not biosamples rna_collect_site | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | I think we can remove this slot & just put the extension that is selected. so 'soil' but a JGI team member should weigh in rna_concentration | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | Would map to processed sample if we do decided to keep it. Why is it in basic_slots and not jgi_metagenomics.yaml rna_cont_type | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | rna_cont_well | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | rna_container_id | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | rna_organisms | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | Remove completely? rna_project_contact | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | Will need for JGI API rna_samp_id | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | this is effectively an alterntaive identifier for a processed sample which is the output of Class Extraction. JGI doesn't really expose these identifiers externally so low priority to keep. Would be worth keeping in the submission portal postgres in case we ever need to pul this in. rna_sample_format | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | not informative for workflows rna_sample_name | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | Would map to processed sample if we do decided to keep it rna_seq_project_pi | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | Will need for JGI API, principal_investigator will be available on OmicsProcessing/DataGeneration rna_seq_project_name | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI | Will pull from JGI API rna_volume | https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/portal/jgi_metatranscriptomics.yaml | JGI |

mslarae13 commented 5 months ago

@pkalita-lbl @turbomam See the table above.

Next step for these JGI and EMSL slots I propose

Remove from the linked .yaml files
Add to NMDC submission schema. @pkalita-lbl Can you add some direction on how I can do this (what file and where to add/move these slots to)

pkalita-lbl commented 5 months ago

The slots would need to be defined in this TSV file: https://github.com/microbiomedata/submission-schema/blob/main/schemasheets/tsv_in/slots.tsv

The association of slots to classes happens in this TSV file: https://github.com/microbiomedata/submission-schema/blob/main/schemasheets/tsv_in/classes.tsv. You can see that a lot of the slots in question are already associated with the appropriate Interface classes. Those would just need to be reviewed to make sure they're still accurate.

mslarae13 commented 5 months ago

From slack

@turbomam "If some UF slots are gong to be moved to submission-schema in the near future, then I would prefer not to move them around within the nmdc-schema modules. I think Alicia and I moved some slots because she was using them outside of their original UF use-case, and that kept the schema from building. We should have included you ion that decison. I'm working on an issue to make all modules self sufficient (ie build on their own), and this will address the question you have asked along the line "how do I know which module to put new content in?" Can we leave the UF slots where they are until then?"

In summary for user facility slots we want to keep in NMDC schema, I'll add alias. In a later task we'll complete this issue, getting the user facility slots that NMDC does NOT need to track removed from NMDC schema, and only have them in NMDC submission portal. We will also later decide if the slots that we DO capture should remain in separate .yaml files or be moved to basic_slots.

ssarrafan commented 5 months ago

@mslarae13 @pkalita-lbl is this an active issue? I'm going to remove this from this sprint and add to the backlog but if it's active please add to a future sprint.

mslarae13 commented 4 weeks ago

microbiomedata / nmdc-schema

the nmdc-schema may contain some `Biosample` slots that belong exclusively to the submission-schema #1454