microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

Expand NMDC to support GLBRC #1110

Open mslarae13 opened 9 months ago

mslarae13 commented 9 months ago

In working with Adina, it's been identified that there are some data as metadata slots that should be added to the NMDC schema to support this project.

Some of the slots below are already in NMDC schema, but aren't in the plant-associated package? need to confirm

In NMDC, add alias

In NMDC, but measurement was on the SOIL. NOT Plant

not in NMDC schema

Terms that will be added to Biosample, but should discuss putting on Site

phenotypic evaluation of plants?

mslarae13 commented 9 months ago

@turbomam misx.yaml has no separation of what environmental extension are associated with which slots, correct?

That's only done at the submissions schema, correct?

turbomam commented 9 months ago

Well, there are a couple of mixs.yaml files. The one in the nmdc-schema repo does not make that connection. It's just a collection of terms, which are associated with the monolithic Biosample class and the OmicsProcessing class in nmdc.yaml.

But the previous collection of MIxS YAML files does

And so does the 6.2 release candidate

But you'll see that those two LinkML versions do it in slightly different ways. We could make all of that more transparent if necessary.

turbomam commented 9 months ago

And yes, the submission schema does it too.

mslarae13 commented 9 months ago

@turbomam why are some descriptions in ' X ' but some aren't? see tot_phosp

mslarae13 commented 9 months ago

why are some descriptions in single quotes but some aren't?

If a phrase has a : in it, Mark writes it in single quotes because : can get messed up.

Need to decide when are single quotes required? Should we just use them all the time? Putting single quotes around is the safer action.

bmeluch commented 8 months ago

Checked the descriptions of all the "In NMDC schema, add to plant-associated" terms for sample type exclusivity. Only "tot_nitro" would exclude plants:

name: tot_nitro
description: 'Total nitrogen concentration of water samples, calculated by: total
  nitrogen = total dissolved nitrogen + particulate nitrogen. Can also be measured
  without filtering, reported as nitrogen'
domain_of:
- HydrocarbonResourcesCores
- HydrocarbonResourcesFluidsSwabs
- WastewaterSludge
- Water 
turbomam commented 8 months ago

@bmeluch @mslarae13 and friends: how would you feel about

We could continue to use GlbrcSample as a mixin, or migrate the slots onto Biosample. But I hope our efforts in December will leave us with a smaller, more modular Biosample overall.

mslarae13 commented 8 months ago

@turbomam are you suggesting a GLBRC yaml file that NMDC schema references for these? Like we do with MIXS and EMSL and JGI

mslarae13 commented 8 months ago

@turbomam how would you like alias' added to existing slots? They don't need to be curies, right? But I want to be able to attribute that this is what this term is called in GLBRC.

turbomam commented 8 months ago

Yes, I am suggesting that strategy. In the cases you mentioned, we define slots in those separate YAML modules and then assign them to Biosample in nmdc.yaml.

But I have raised the general question to our team: do we want more or fewer YAML modules. At one point, @mslarae13, I think you found the multiple modules difficult to search through. Have you become more comfortable with "find-in-files" in PyCharm or some other tool? I think the PyCharm functionality can be accessed with shift-command-f. @mbthornton-lbl has confirmed that he finds the practice of separate, thematic modules helpful.

I am really concerned about the number of slots we are adding onto Biosample. @mbthornton-lbl has opened an issue to propose a refactoring. I don't know if we can or should work on that before, during or after our December meeting in Berkeley.

turbomam commented 8 months ago

Aliases can be assigned to a schema element with attribution with

mslarae13 commented 8 months ago

I'm ok with separate organized files. I can figure it out. I just want to make sure we're making the right decision for the right reasons. We'll make a GLBRC yaml to capture the new slots & can merge it in depending on the decision.

mslarae13 commented 8 months ago

Will the BRCs capture this information consistently? Are these slots specific to this study? Do we make mappings and aliases or provide a tool for converting their metadata to the term NMDC would use? For close mappings and more specific mapping, curie required. Can we discuss at the next GLBRC meeting seeing their DH implementation, model, and schema.

mslarae13 commented 8 months ago

For now, pull in the submission and pause on mapping & additional/ new metadata fields until we meet. Ingest the mapped metadata & skip schema mapping for now until . Provide a spreadsheet with a column for the GLBRC term and the schema term.