microbiomedata / issues

public repo for issues related to NMDC work
1 stars 0 forks source link

GOLD ecosystem in submissions vs in GOLD #739

Open mslarae13 opened 1 week ago

mslarae13 commented 1 week ago

Adina completed metadata via the submission portal for GLBRC see study here: https://data.microbiomedata.org/details/study/nmdc:sty-11-e4yb9z58

I am pretty sure this is this GOLD study : https://gold.jgi.doe.gov/study?id=Gs0128851

Adina provided a DOE award, but on the data tab did not indicate that the metagenome data was generated at JGI in the submission.

In GOLD, the ecosystem path for biosamples is (spot checked, did not look at the ALL)

Ecosystem Host-associated
Ecosystem Category Plants
Ecosystem Type Phyllosphere
Ecosystem Subtype Phylloplane/Leaf
Specific Ecosystem Unclassified
In NMDC submission portal, the path provided is Environmental Terrestrial Plant-associated Leaf Phyllosphere

2 problems..

  1. They don't match. What do we use.
  2. Plant-associated | Leaf | Phyllosphere isn't valid. I'm not sure how it passed validation & got submitted!

@pkalita-lbl can you help figure out why this was valid and isn't now... I am guessing GOLD updated it's paths, and we pulled the updated paths? Cuz I'm looking at https://gold.jgi.doe.gov/ecosystemtree & these are not infact there... but IDK how Adina could've submitted if it wasn't valid before.

@aclum @sujaypatil96 can you check out GOLD and let me know if I am correct on the GOLD study @aclum can you chime in on what we should use? In the past, @emileyfadrosh has commented that we should use what the user says as it's their samples and their classification.. but, now it's not valid!

pkalita-lbl commented 1 week ago

My notes suggest that we brought that submission into Mongo in November of 2023.

The changes to keep the submission schema in sync with GOLD ecosystem classification terms were done in January 2024. Those changes also updated the range for the 5 GOLD ecosystem classification slots. Prior to those changes, the range in anything other than the soil template (and the submission in question uses the plant-associated template) was string.

So yeah it makes sense to me that the submission was valid in November 2023, but not now. And it's because of improvements to our schema, not because of changes at GOLD.

mslarae13 commented 1 week ago

That makes sense! So in light of that @aclum I suggest we submit a change-sheet to make the the NMDC GOLD paths match GOLD. Since that's actually valid.

aclum commented 5 days ago

Yes, this is the correct GOLD study ID. You can check this by getting the site award number from the osti page (https://www.osti.gov/award-doi-service/biblio/10.46936/10.25585/60000818) and using that ID to query the GOLD studies endpoint, you can do this interactively from the swagger UI if you are logged in with your ORCID. The json document that it returns has the GOLD study ID value in the field 'studyGoldId'.

Do we have a mapping file between nmdc IDs and GOLD IDs? If so we can use the existing GOLD ingest code to pull update the the gold fields, add the bioproject & sample information and create the omics records so we can pull the data in.

mslarae13 commented 4 days ago

Decision, use the GOLD paths that are in GOLD, not the ones the user provided. No longer blocked.

I am pretty sure the gold biosample IDs are stored. So yes? @sujaypatil96 can you help with this?