Closed wdduncan closed 3 years ago
Thanks for opening this, @wdduncan. We've isolated the issue to the jgi.run_etl
solid, which encapsulates the translation of the JGI GOLD export to the NMDC schema format.
Let me know if you have a suggestion for a better name for this. I'm thinking to rename the jgi
pipeline to gold_translation
to better reflect its scope, so e.g. gold_translation.run_etl
would be the solid (assuming the job is still black-boxed to run a make
target).
gold_translation.run_etl
seems reasonable to me.
👍 done (with the renaming)
Updating ticket with tasks discussed on infrastructure/kitware call.
The reason for the missing biosample records was that the env_broad_scale
values were missing for these records. Since this is a required field in the schema, the records were being filtered out.
After reviewing the missing records, we decided:
env_broad_scale
values in the GOLD ETL dump with the value ENVO_00000446
(terrestrial biome)longitude
for biosample Gb0291638
to value -123.194
After these updates are completed, re-run the ETL and pass to @dwinston
cc @dehays @emileyfadrosh @pvangay @cmungall
@pvangay requests that biosample Gb0119280
GOLD path be updated to:
ecosystem: Host-associated
ecosystem_category: Mammals
ecosystem-type: Digestive system
ecosystem_subtype: Large intestine
specific_ecosystem: Fecal
Proposed values for the mixs triad are:
env_broad_scale: terrestrial biome
env_local_scale: large intestine (a term from UBERON)
env_medium: fecal material
@jagadishcs does sound okay to you?
@wdduncan @pvangay @TBKReddy
The BioSample, https://gold.jgi.doe.gov/biosample?id=Gb0119280 has Sample Collection Site - Enriched soil from Old Woman Creek wetland and Habitat - Enriched soil. If there is any change in the biosample / material studied, multiple fields, including biosample name, project name along with GOLD ecosystem classification path and EnvO triad, etc. in the GOLD have to be updated and the same changes have to be informed to the NCBI since this BioSample was submitted by JGI. Further, we need to update all these fields in the NMDC. So, please confirm this change with all available information for this biosample.
In your translation of a GOLD ecosystem classification path to EnvO triad, I would agree with env_local_scale and env_medium terms but not with the env_broad_scale term, terrestrial biome for mammals (by the way, I did not take aquatic mammals in account to disagree with you).
We need to discuss and make a consensus for what env_broad_scale terms would be better for host-associated (human/animal) biosamples; a decision now on this issue will save us a lot of time in the future.
@jagadishcs this is confirmed to be a moose gut sample according to Reb Daly, who ran this study for Kelly Wrighton. We talked about this at length and I just confirmed again with her now. Please update it accordingly. As for the ENVO and GOLD terms, if the initial assignments I came up with for a "moose gut environment" are incorrect, then please update it to what is most appropriate as that area is not my expertise. Thanks!
From Reb today: "This is an enrichment from moose rumen (moose gut). This is one of those instances where we wanted to see if we saw the same gene patterns/diversity in cellulose degradation between the fracking wells and moose gut with high cellulose. Definitely not soil."
fyi @emileyfadrosh
@jagadishcs thanks for looping me on the metadata update for https://gold.jgi.doe.gov/biosample?id=Gb0119280 from soil sample to host associated fecal sample and @pvangay for providing additional information from the PI.
@jagadishcs go ahead and create a GOLD ticket for updatingng this biosample/project and notifying NCBI about these updates.
For ENVO broad scale, leave it blank in GOLD, we will not be using terrestrial biome term for any of these host-associated samples as it will conflict with the rest of the curated metadata including the habitat and confusing. So it can remain blank until there is a suitable term to use in the future.
Closing this ticket as no further work on the NMDC side is needed. Thanks @pvangay for confirming the metadata updates needed for GOLD.
@TBKReddy @jagadishcs I followed up again with Reb to make sure we understand exactly how the "moose rumen" was sampled -- and she said that they implanted a port into the moose's rumen to directly sample from there. So my apologies for assuming it was fecal material. Please update that accordingly -- could you let us know what your final recommendations for the GOLD and ENVO terms are based on this information? Thank you.
@pvangay, thanks for this additional information. I updated GOLD ticket where the needed updates work is being tracked. @jagadishcs will get to this tomorrow after his return and will keep you posted or point you to the updated values. Mean while, I wanted to check if the moose was fed with any special diet or anything was added to the rumen contents, or it is on its natural/normal diet. If you can check this with the PI, that will be great. Thank you.
@TBKReddy: No special diet - just natural vegetation in Alaska.
FYI, they did reference this paper, which has much more detail about moose sampling and procedures. Note that the paper references a control diet - but Reb confirmed that the sample came from a moose who ate natural vegetation. She doesn't have additional information about the specific sample (time points, specific location, etc.) that was included in this sequencing project. Hope this is helpful.
Please let me know if you need anything else. I think if you have all that you need - we can close this ticket.
@pvangay thanks for the additional information. Yes, you can go ahead and close this ticket.
@jagadishcs , please note the geographic location for this sample (Gb0119280). Now it is Alaska and not Ohio.
Thank you @TBKReddy !!
Adding links to related issues even though it's closed now https://github.com/microbiomedata/nmdc-metadata/issues/355 https://github.com/microbiomedata/nmdc-metadata/issues/356
Hi @wdduncan, @TBKReddy Reopening this issue to let you know the updated metadata in GOLD for a BioSample (Gb0119280), so that you can please update accordingly in the NMDC biosample entry:
New BioSample Name: Rumen-fistulated moose microbial communities from Alaska, USA - LMS_cellobiose_enrichment
New Habitat: Rumen-fistulated moose New sample collection site: Rumen fluid from live moose (Rumen-fistulated moose)
The GOLD Ecosystem path has been updated to: Ecosystem - Host-associated Ecosystem Category - Mammals Ecosystem Type - Digestive system Ecosystem Subtype - Stomach Specific Ecosystem - Rumen
New Geographic Location - USA: Matanuska Research Center, Alaska New Latitude: 61.566367 New Latitude: -149.2538247
Since the BioSample studied was rumen fluid, the following terms are suggested: env_local_scale: digestive system (UBERON_0001007) env_material: biological fluid (SIO_010051) or we can suggest EnvO to create a new term biological fluid material.
env_local_scale: digestive system (UBERON_0001007)
UBERON:0007365 ! rumen
env_material: biological fluid (SIO_010051)
UBERON:0006314 ! biological fluid
But note we have not yet discussed host-associated, I didn't realize we were doing these. There are specific fields for the host-associated package that are better homes for this level of specificity. From the mixs schema:
host_body_habitat:
is_a: environment field
aliases:
- host body habitat
description: Original body habitat where the sample was obtained from
range: string
examples:
- value: nasopharynx
comments:
- 'Expected value: free text'
- 'Occurrence: 1'
- 'Position: 14.0'
- 'This field is used uniquely in: host-associated'
pattern: '{text}'
slot_uri: MIXS:0000866
host_body_site:
is_a: environment field
aliases:
- host body site
description: Name of body site where the sample was obtained from, such as a specific
organ or tissue (tongue, lung etc...). For foundational model of anatomy ontology
(fma) (v 4.11.0) or Uber-anatomy ontology (UBERON) (v releases/2014-06-15) terms,
please see http://purl.bioontology.org/ontology/FMA or http://purl.bioontology.org/ontology/UBERON
range: string
examples:
- value: gill [UBERON:0002535]
comments:
- 'Expected value: FMA or UBERON'
- 'Occurrence: 1'
- 'Position: 15.0'
- 'This field is used in: 6 packages: host-associated, human-associated, human-gut,
human-oral, human-skin, human-vaginal'
pattern: '{termLabel} {[termID]}'
slot_uri: MIXS:0000867
host_body_product:
is_a: environment field
aliases:
- host body product
description: Substance produced by the body, e.g. Stool, mucus, where the sample
was obtained from. For foundational model of anatomy ontology (fma) or Uber-anatomy
ontology (UBERON) terms, please see https://www.ebi.ac.uk/ols/ontologies/fma
or https://www.ebi.ac.uk/ols/ontologies/uberon
range: string
examples:
- value: Portion of mucus [fma66938]
comments:
- 'Expected value: FMA or UBERON'
- 'Occurrence: 1'
- 'Position: 16.0'
- 'This field is used in: 6 packages: host-associated, human-associated, human-gut,
human-oral, human-skin, human-vaginal'
pattern: '{termLabel} {[termID]}'
slot_uri: MIXS:0000888
Checked with @wdduncan and @emileyfadrosh and closing this issue. The moose sample will be removed from the portal.
The ETL process on the
NMDC_DUMP_Jun_21_2021
GOLD data dump failed translate all the biosamples. A list of the gold ids of 108 such failures are listed below.cc @dwinston