microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
5 stars 3 forks source link

missing 108 samples from the NMDC_DUMP_Jun_21_2021 GOLD data dump #11

Closed wdduncan closed 3 years ago

wdduncan commented 3 years ago

The ETL process on the NMDC_DUMP_Jun_21_2021 GOLD data dump failed translate all the biosamples. A list of the gold ids of 108 such failures are listed below.

cc @dwinston

1 gold:Gb0291799
2 gold:Gb0291728
3 gold:Gb0291794
4 gold:Gb0291771
5 gold:Gb0291740
6 gold:Gb0291713
7 gold:Gb0291768
8 gold:Gb0291757
9 gold:Gb0291795
10 gold:Gb0291766
11 gold:Gb0291797
12 gold:Gb0291716
13 gold:Gb0291739
14 gold:Gb0291699
15 gold:Gb0291790
16 gold:Gb0291787
17 gold:Gb0291769
18 gold:Gb0291732
19 gold:Gb0291756
20 gold:Gb0291733
21 gold:Gb0291791
22 gold:Gb0291726
23 gold:Gb0291693
24 gold:Gb0291785
25 gold:Gb0291719
26 gold:Gb0291717
27 gold:Gb0291746
28 gold:Gb0291765
29 gold:Gb0291738
30 gold:Gb0291777
31 gold:Gb0291714
32 gold:Gb0291712
33 gold:Gb0291751
34 gold:Gb0291792
35 gold:Gb0291744
36 gold:Gb0291718
37 gold:Gb0291758
38 gold:Gb0291727
39 gold:Gb0291783
40 gold:Gb0291708
41 gold:Gb0291711
42 gold:Gb0291722
43 gold:Gb0291775
44 gold:Gb0291700
45 gold:Gb0291779
46 gold:Gb0291748
47 gold:Gb0291752
48 gold:Gb0291761
49 gold:Gb0291764
50 gold:Gb0291729
51 gold:Gb0291720
52 gold:Gb0291696
53 gold:Gb0291702
54 gold:Gb0291709
55 gold:Gb0291698
56 gold:Gb0291701
57 gold:Gb0291710
58 gold:Gb0291697
59 gold:Gb0291776
60 gold:Gb0291737
61 gold:Gb0291734
62 gold:Gb0291721
63 gold:Gb0291731
64 gold:Gb0291706
65 gold:Gb0291793
66 gold:Gb0291692
67 gold:Gb0291784
68 gold:Gb0291789
69 gold:Gb0291778
70 gold:Gb0291767
71 gold:Gb0291747
72 gold:Gb0291694
73 gold:Gb0291798
74 gold:Gb0291695
75 gold:Gb0291741
76 gold:Gb0291770
77 gold:Gb0291782
78 gold:Gb0291742
79 gold:Gb0291735
80 gold:Gb0291715
81 gold:Gb0291760
82 gold:Gb0291763
83 gold:Gb0291780
84 gold:Gb0291703
85 gold:Gb0291781
86 gold:Gb0291707
87 gold:Gb0291753
88 gold:Gb0291749
89 gold:Gb0291704
90 gold:Gb0291755
91 gold:Gb0291796
92 gold:Gb0291736
93 gold:Gb0291750
94 gold:Gb0291754
95 gold:Gb0291743
96 gold:Gb0291705
97 gold:Gb0291773
98 gold:Gb0291723
99 gold:Gb0291730
100 gold:Gb0291788
101 gold:Gb0291725
102 gold:Gb0291724
103 gold:Gb0291786
104 gold:Gb0291745
105 gold:Gb0291774
106 gold:Gb0291772
107 gold:Gb0291759
108 gold:Gb0291762
dwinston commented 3 years ago

Thanks for opening this, @wdduncan. We've isolated the issue to the jgi.run_etl solid, which encapsulates the translation of the JGI GOLD export to the NMDC schema format.

Let me know if you have a suggestion for a better name for this. I'm thinking to rename the jgi pipeline to gold_translation to better reflect its scope, so e.g. gold_translation.run_etl would be the solid (assuming the job is still black-boxed to run a make target).

wdduncan commented 3 years ago

gold_translation.run_etl seems reasonable to me.

dwinston commented 3 years ago

👍 done (with the renaming)

wdduncan commented 3 years ago

Updating ticket with tasks discussed on infrastructure/kitware call.

The reason for the missing biosample records was that the env_broad_scale values were missing for these records. Since this is a required field in the schema, the records were being filtered out.
After reviewing the missing records, we decided:

After these updates are completed, re-run the ETL and pass to @dwinston

cc @dehays @emileyfadrosh @pvangay @cmungall

wdduncan commented 3 years ago

@pvangay requests that biosample Gb0119280 GOLD path be updated to:

ecosystem: Host-associated 
ecosystem_category: Mammals 
ecosystem-type: Digestive system 
ecosystem_subtype: Large intestine 
specific_ecosystem: Fecal 

Proposed values for the mixs triad are:

env_broad_scale: terrestrial biome
env_local_scale: large intestine (a term from UBERON)
env_medium: fecal material

@jagadishcs does sound okay to you?

jagadishcs commented 3 years ago

@wdduncan @pvangay @TBKReddy

pvangay commented 3 years ago

@jagadishcs this is confirmed to be a moose gut sample according to Reb Daly, who ran this study for Kelly Wrighton. We talked about this at length and I just confirmed again with her now. Please update it accordingly. As for the ENVO and GOLD terms, if the initial assignments I came up with for a "moose gut environment" are incorrect, then please update it to what is most appropriate as that area is not my expertise. Thanks!

From Reb today: "This is an enrichment from moose rumen (moose gut). This is one of those instances where we wanted to see if we saw the same gene patterns/diversity in cellulose degradation between the fracking wells and moose gut with high cellulose. Definitely not soil."

fyi @emileyfadrosh

TBKReddy commented 3 years ago

@jagadishcs thanks for looping me on the metadata update for https://gold.jgi.doe.gov/biosample?id=Gb0119280 from soil sample to host associated fecal sample and @pvangay for providing additional information from the PI.

@jagadishcs go ahead and create a GOLD ticket for updatingng this biosample/project and notifying NCBI about these updates.

For ENVO broad scale, leave it blank in GOLD, we will not be using terrestrial biome term for any of these host-associated samples as it will conflict with the rest of the curated metadata including the habitat and confusing. So it can remain blank until there is a suitable term to use in the future.

emileyfadrosh commented 3 years ago

Closing this ticket as no further work on the NMDC side is needed. Thanks @pvangay for confirming the metadata updates needed for GOLD.

pvangay commented 3 years ago

@TBKReddy @jagadishcs I followed up again with Reb to make sure we understand exactly how the "moose rumen" was sampled -- and she said that they implanted a port into the moose's rumen to directly sample from there. So my apologies for assuming it was fecal material. Please update that accordingly -- could you let us know what your final recommendations for the GOLD and ENVO terms are based on this information? Thank you.

TBKReddy commented 3 years ago

@pvangay, thanks for this additional information. I updated GOLD ticket where the needed updates work is being tracked. @jagadishcs will get to this tomorrow after his return and will keep you posted or point you to the updated values. Mean while, I wanted to check if the moose was fed with any special diet or anything was added to the rumen contents, or it is on its natural/normal diet. If you can check this with the PI, that will be great. Thank you.

pvangay commented 3 years ago

@TBKReddy: No special diet - just natural vegetation in Alaska.

FYI, they did reference this paper, which has much more detail about moose sampling and procedures. Note that the paper references a control diet - but Reb confirmed that the sample came from a moose who ate natural vegetation. She doesn't have additional information about the specific sample (time points, specific location, etc.) that was included in this sequencing project. Hope this is helpful.

Please let me know if you need anything else. I think if you have all that you need - we can close this ticket.

TBKReddy commented 3 years ago

@pvangay thanks for the additional information. Yes, you can go ahead and close this ticket.

@jagadishcs , please note the geographic location for this sample (Gb0119280). Now it is Alaska and not Ohio.

pvangay commented 3 years ago

Thank you @TBKReddy !!

ssarrafan commented 3 years ago

Adding links to related issues even though it's closed now https://github.com/microbiomedata/nmdc-metadata/issues/355 https://github.com/microbiomedata/nmdc-metadata/issues/356

jagadishcs commented 3 years ago

Hi @wdduncan, @TBKReddy Reopening this issue to let you know the updated metadata in GOLD for a BioSample (Gb0119280), so that you can please update accordingly in the NMDC biosample entry:

New BioSample Name: Rumen-fistulated moose microbial communities from Alaska, USA - LMS_cellobiose_enrichment

New Habitat: Rumen-fistulated moose New sample collection site: Rumen fluid from live moose (Rumen-fistulated moose)

The GOLD Ecosystem path has been updated to: Ecosystem - Host-associated Ecosystem Category - Mammals Ecosystem Type - Digestive system Ecosystem Subtype - Stomach Specific Ecosystem - Rumen

New Geographic Location - USA: Matanuska Research Center, Alaska New Latitude: 61.566367 New Latitude: -149.2538247

Since the BioSample studied was rumen fluid, the following terms are suggested: env_local_scale: digestive system (UBERON_0001007) env_material: biological fluid (SIO_010051) or we can suggest EnvO to create a new term biological fluid material.

cmungall commented 3 years ago

env_local_scale: digestive system (UBERON_0001007)

UBERON:0007365 ! rumen

env_material: biological fluid (SIO_010051)

UBERON:0006314 ! biological fluid

But note we have not yet discussed host-associated, I didn't realize we were doing these. There are specific fields for the host-associated package that are better homes for this level of specificity. From the mixs schema:

  host_body_habitat:
    is_a: environment field
    aliases:
    - host body habitat
    description: Original body habitat where the sample was obtained from
    range: string
    examples:
    - value: nasopharynx
    comments:
    - 'Expected value: free text'
    - 'Occurrence: 1'
    - 'Position: 14.0'
    - 'This field is used uniquely in: host-associated'
    pattern: '{text}'
    slot_uri: MIXS:0000866
  host_body_site:
    is_a: environment field
    aliases:
    - host body site
    description: Name of body site where the sample was obtained from, such as a specific
      organ or tissue (tongue, lung etc...). For foundational model of anatomy ontology
      (fma) (v 4.11.0) or Uber-anatomy ontology (UBERON) (v releases/2014-06-15) terms,
      please see http://purl.bioontology.org/ontology/FMA or http://purl.bioontology.org/ontology/UBERON
    range: string
    examples:
    - value: gill [UBERON:0002535]
    comments:
    - 'Expected value: FMA or UBERON'
    - 'Occurrence: 1'
    - 'Position: 15.0'
    - 'This field is used in: 6 packages: host-associated, human-associated, human-gut,
      human-oral, human-skin, human-vaginal'
    pattern: '{termLabel} {[termID]}'
    slot_uri: MIXS:0000867
  host_body_product:
    is_a: environment field
    aliases:
    - host body product
    description: Substance produced by the body, e.g. Stool, mucus, where the sample
      was obtained from. For foundational model of anatomy ontology (fma) or Uber-anatomy
      ontology (UBERON) terms, please see https://www.ebi.ac.uk/ols/ontologies/fma
      or https://www.ebi.ac.uk/ols/ontologies/uberon
    range: string
    examples:
    - value: Portion of mucus [fma66938]
    comments:
    - 'Expected value: FMA or UBERON'
    - 'Occurrence: 1'
    - 'Position: 16.0'
    - 'This field is used in: 6 packages: host-associated, human-associated, human-gut,
      human-oral, human-skin, human-vaginal'
    pattern: '{termLabel} {[termID]}'
    slot_uri: MIXS:0000888
ssarrafan commented 3 years ago

Checked with @wdduncan and @emileyfadrosh and closing this issue. The moose sample will be removed from the portal.