microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Use MIxS `samp_taxon_id` to model NCBI taxonomy ID from GOLD? #574

Closed sujaypatil96 closed 1 year ago

sujaypatil96 commented 1 year ago

Add a slot called ncbi_taxon_id to capture the host taxonomy ID as present in GOLD, and other sources possibly.

For example, let's look at Gb0291745 on the GOLD website: https://gold.jgi.doe.gov/biosample?id=Gb0291745

Look under the Host Metadata section, and you'll see a value of 3689 associated with the host taxonomy ID.

We need a slot under the Biosample class to capture this information. In this issue, I'm proposing the addition of a slot called ncbi_taxon_id generally, and asserting its usage in the Biosample class using it's slots property.

sujaypatil96 commented 1 year ago

Comments from discussion with @turbomam:

There is a slot in MIxS called specific_host already, which models the taxonomy name or id.

We have two approaches here:

ssarrafan commented 1 year ago

@sujaypatil96 is this something you're currently working on? Can I add it to the sprint board for January?

sujaypatil96 commented 1 year ago

Model this similar to ENVO terms. Followup with GSC to figure out why there are three similar terms for taxon id.

sujaypatil96 commented 1 year ago

Following up on this, the three MIxS terms with confusing definitions are:

ssarrafan commented 1 year ago

@sujaypatil96 I'll move this to the next sprint but let me know if you won't be actively working on it for the next couple of weeks

sujaypatil96 commented 1 year ago

@ssarrafan yup, we plan to address this at the metadata call today.

sujaypatil96 commented 1 year ago

The information that we intend to capture in the schema from GOLD is the NCBI taxonomy ID (this is the label for the field that appears on the JGI GOLD website)

For example, below is a snippet of the output from the GOLD API:

{'biosampleGoldId': 'Gb0291653', 'biosampleName': 'Bulk soil microbial communities from poplar common garden site in 
Clatskanie, Oregon, USA - BESC-86-CL2_69_17', 'ncbiTaxId': 410658, 'ncbiTaxName': 'soil metagenome', 
'sampleCollectionSite': 'Bulk Soil', 'geographicLocation': 'USA: Oregon'...}

Semantic considerations

"soil metagenome" is not a host, so neither of the MIxS terms with the word "host" is applicable, leaving only the samp_taxon_id term from the original 3 candidates (specific_host, host_taxid and samp_taxon_id)

Format considerations

Looking at the ncbiTaxId field, it seems to be an integer value. And here is an example value for the samp_taxon_id field that MIxS provides:

Gut Metagenome [NCBI:txid749906]

We could reconstruct a value that looks like the above example using two GOLD fields - ncbiTaxName and ncbiTaxId. We propose replacing the syntax implied by the MIxS example above with syntax like: NCBITaxon:410658 (as found on OLS)

mslarae13 commented 1 year ago
  • samp_taxon_id

Agree, samp_taxon_id is the right slot. This is NOT collected by the user or in the submission portal, but I think that’s fine. This can just be a GOLD assigned field

These descriptions are more or less forward and reverse, so how they're different is unclear, but I think this is a different issue.

Decision during Wednesday 1pm metadata meeting

Format change ncbiTaxName to NCBITaxon:#### Capture from GOLD “ ncbiTaxName [NCBITaxon:ncbiTaxID] “ example gut metagenome [NCBITaxon:749906] or soil metagenome [NCBITaxon:######]

turbomam commented 1 year ago

Thanks for the notes, @mslarae13

A couple of questions:

  1. You're suggesting that there wouldn't be any column for samp_taxon_id in the Submission Portal. How would that relate to our vision of gathering sample metadata in bulk form external sources (like GOLD) and then loading it into the submission portal for a "data wrangler" to check?
  2. what does "descriptions are more or less forward and reverse" mean?
  3. Could the last two lines in your comment be summarized as below?

See regexr.com for experimenting with those patterns