Closed sujaypatil96 closed 1 year ago
Comments from discussion with @turbomam:
There is a slot in MIxS called specific_host already, which models the taxonomy name or id.
We have two approaches here:
mixs.yaml
file in nmdc-schema repo, add in specific_host
slot, import that into Biosample
class in nmdc.yaml
and constrain specific_host
under the slot_usage
for Biosample
nmdc.yaml
called ncbi_taxon_id
which is inherited from specific_host
using the is_a
property, constrain the slot and then assert it on the Biosample
class@sujaypatil96 is this something you're currently working on? Can I add it to the sprint board for January?
Model this similar to ENVO terms. Followup with GSC to figure out why there are three similar terms for taxon id.
Following up on this, the three MIxS terms with confusing definitions are:
@sujaypatil96 I'll move this to the next sprint but let me know if you won't be actively working on it for the next couple of weeks
@ssarrafan yup, we plan to address this at the metadata call today.
The information that we intend to capture in the schema from GOLD is the NCBI taxonomy ID (this is the label for the field that appears on the JGI GOLD website)
For example, below is a snippet of the output from the GOLD API:
{'biosampleGoldId': 'Gb0291653', 'biosampleName': 'Bulk soil microbial communities from poplar common garden site in
Clatskanie, Oregon, USA - BESC-86-CL2_69_17', 'ncbiTaxId': 410658, 'ncbiTaxName': 'soil metagenome',
'sampleCollectionSite': 'Bulk Soil', 'geographicLocation': 'USA: Oregon'...}
ncbiTaxId
is NCBI Taxonomy ID."soil metagenome" is not a host, so neither of the MIxS terms with the word "host" is applicable, leaving only the samp_taxon_id term from the original 3 candidates (specific_host, host_taxid and samp_taxon_id)
Looking at the ncbiTaxId
field, it seems to be an integer value. And here is an example value for the samp_taxon_id
field that MIxS provides:
Gut Metagenome [NCBI:txid749906]
We could reconstruct a value that looks like the above example using two GOLD fields - ncbiTaxName and ncbiTaxId. We propose replacing the syntax implied by the MIxS example above with syntax like: NCBITaxon:410658
(as found on OLS)
- samp_taxon_id
Agree, samp_taxon_id is the right slot. This is NOT collected by the user or in the submission portal, but I think that’s fine. This can just be a GOLD assigned field
- specific_host: https://genomicsstandardsconsortium.github.io/mixs/0000029/
- host_taxid:
These descriptions are more or less forward and reverse, so how they're different is unclear, but I think this is a different issue.
Decision during Wednesday 1pm metadata meeting
Format change ncbiTaxName to NCBITaxon:#### Capture from GOLD “ ncbiTaxName [NCBITaxon:ncbiTaxID] “ example gut metagenome [NCBITaxon:749906] or soil metagenome [NCBITaxon:######]
Thanks for the notes, @mslarae13
A couple of questions:
samp_taxon_id
in the Submission Portal. How would that relate to our vision of gathering sample metadata in bulk form external sources (like GOLD) and then loading it into the submission portal for a "data wrangler" to check?samp_taxon_id
slots for Biosample
s by composing the ncbiTaxName
and ncbiTaxId
. For the taxon identifier portion, MIxS implies a syntax of ^NCBI:txid[0-9]+$
like 'NCBI:txid749906' but instead NMDC will follow the syntax used by the OBO foundry, ^NCBITaxon:[0-9]+$
like 'NCBITaxon:749906'See regexr.com for experimenting with those patterns
Add a slot called
ncbi_taxon_id
to capture the host taxonomy ID as present in GOLD, and other sources possibly.For example, let's look at
Gb0291745
on the GOLD website: https://gold.jgi.doe.gov/biosample?id=Gb0291745Look under the Host Metadata section, and you'll see a value of 3689 associated with the host taxonomy ID.
We need a slot under the Biosample class to capture this information. In this issue, I'm proposing the addition of a slot called
ncbi_taxon_id
generally, and asserting its usage in the Biosample class using it'sslots
property.