microbiomedata / sample-annotator

NMDC Sample Annotator
https://microbiomedata.github.io/sample-annotator/static/intro.html
5 stars 9 forks source link

Determine if and how `rel_to_oxygen` will be used in the submission schema #58

Closed turbomam closed 10 months ago

turbomam commented 2 years ago

This illustrates approaches for repairing columns with enumerations of permissible values, also known as controlled vocabularies. Look for 'enumeration' in the MIxS' Expected value column, or a range of '*_enum' in the LinkML model. See reference material below.

Related code: sample_annotator/rel_to_oxygen_example.py

Permissible values

Reference material

Observed, with matches

rel_to_oxygen r2o_count lc_trimmed_r2o match interpretation
None 44666 none    
aerobe 3940 aerobe aerobe  
obligate anaerobe 66 obligate anaerobe obligate anaerobe  
oxic 59 oxic    
anaerobe 29 anaerobe anaerobe  
facultative anaerobes 21 facultative anaerobes   facultative
Aerobic 20 aerobic aerobe  
aerobic 18 aerobic aerobe  
anaerobic 18 anaerobic anaerobe  
Oxic 13 oxic    
microaerophilic 11 microaerophilic microaerophilic  
hypoxic 6 hypoxic    
normal oxic seawater 4 normal oxic seawater    
oxic/anoxic boundary 4 oxic/anoxic boundary    
22 mg/l 3 22 mg/l   oxic
6.0-6.5 mg/l 3 6.0-6.5 mg/l    
Hypoxic 3 hypoxic    
0 mg/l 1 0 mg/l   anoxic
1.0-2.2 mg/l 1 1.0-2.2 mg/l    
23.5 mg/l 1 23.5 mg/l   oxic
25 mg/l 1 25 mg/l   oxic
aerobic-anaerobic 1 aerobic-anaerobic    
facultative 1 facultative facultative  
facultative anaerobe 1 facultative anaerobe   facultative
obligate 1 obligate    

Easy fixes:

Trickier!

Probably not justified when the count is really low, like 1

Gotchas:

mslarae13 commented 1 year ago

Can we run a query that asks "of the samples captured in NMDC (mongoDB), do any of the Biosample objects have this slot (or oxy_stat_samp) filled out? If so, what is there?"

mslarae13 commented 1 year ago

Once @turbomam has made a query from NMDC mongoDB, reassign to Montana to check

mslarae13 commented 1 year ago

Only keep rel_to_oxygen. Note in rel_to_oxygen that this is applicable to "Column: oxygenation status of sample".

turbomam commented 1 year ago
db.getCollection("biosample_set").find( { part_of : { $exists : true } } );

2449

db.getCollection("biosample_set").find( { rel_to_oxygen : { $exists : true } } );

0

db.getCollection("biosample_set").find( { part_of : { $exists : true } } );

0

turbomam commented 1 year ago

Neither rel_to_oxygen nor oxy_stat_samp hav been provided for any biosample in the production MongoDB as of this date.

turbomam commented 1 year ago
Structured comment name Item (rdfs:label) Definition Expected value Value syntax Example Section migs_eu migs_ba migs_pl migs_vi migs_org mims mimarks_s mimarks_c misag mimag miuvig Preferred unit Occurence MIXS ID
rel_to_oxygen relationship to oxygen Is this organism an aerobe, anaerobe? Please note that aerobic and anaerobic are valid descriptors for microbial environments enumeration [aerobe|anaerobe|facultative|microaerophilic|microanaerobe|obligate aerobe|obligate anaerobe] aerobe nucleic acid sequence source - C - - - X X C X X -   1 MIXS:0000015
Environmental package Structured comment name Package item Definition Expected value Value syntax Example Requirement Preferred unit Occurrence MIXS ID
agriculture oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic]   C   1 MIXS:0000753
air oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
host-associated oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-associated oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-gut oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-oral oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-skin oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-vaginal oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
hydrocarbon resources-cores oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
hydrocarbon resources-fluids/swabs oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
microbial mat/biofilm oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
miscellaneous natural or artificial environment oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
plant-associated oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
sediment oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
symbiont-associated oxy_stat_samp oxygenation status of sample Oxygenation status of sample. enumeration [aerobic|anaerobic] aerobic X   1 MIXS:0000753
wastewater/sludge oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
water oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
turbomam commented 1 year ago

@mslarae13 I agree that we should only use one of rel_to_oxygen or oxy_stat_samp for NMDC biosamples. I would be more inclined to use oxy_stat_samp since it is supposed to be about samples and rel_to_oxygen is supposed to be about organisms (for checklists like MIGS-ba?)

After deciding that, we should combine the values from the rel_to_oxygen enumeration, the oxy_stat_samp enumeration and the values found the NCBI's biosample_set into one reasonable NMDC enumeration.

turbomam commented 1 year ago

Here are the oxy_stat_samp values in BBOP's relational version of NCBI's biosample_set

select
    value,
    count(1)
from
    all_attribs aa
where
    aa.harmonized_name = 'oxy_stat_samp'
group by
    value
order by
    count(1) desc ;
value count
aerobic 4420
NA 4332
anaerobic 3890
not collected 2730
not applicable 2165
missing 1757
anaerobe 535
NOT APPLICABLE 418
0,00 144
none 123
N/A 39
aerobe 36
Unknown 26
not collecte 24
Not collected 17
Not available 16
5,03 6
7,37 6
14,75 6
10,45 6
4,89 6
7,07 6
unknown 6
7,50 6
4,84 6
10,56 6
10,06 6
7,76 6
3,53 6
9,67 6
2,34 6
6,80 6
5,22 6
2,45 6
4,67 6
13,60 6
5,28 6
3,75 6
15,51 6
8.62 3
Not applicable 2
7,29 2
17,24 2
17,24 mg/L 1
not provided 1
the sediment is anoxic but the water colum contains O2 (6.78-7.66 mg/L) 1
turbomam commented 1 year ago

If we use oxy_stat_samp, maybe the following really would be adequate

The other oxy_stat_samp values boil down to either some variant of NA or a concentration of oxygen, presumably in mg/L

ssarrafan commented 1 year ago

Adding to current sprint per Mark. Need feedback from @mslarae13

mslarae13 commented 1 year ago

I would be more inclined to use oxy_stat_samp since it is supposed to be about samples and rel_to_oxygen is supposed to be about organisms (for checklists like MIGS-ba?)

  • I'm good with that!

After deciding that, we should combine the values from the rel_to_oxygen enumeration, the oxy_stat_samp enumeration and the values found the NCBI's biosample_set into one reasonable NMDC enumeration.

  • Also agree with providing the full enumeration list. @turbomam
turbomam commented 1 year ago

@mslarae13 I'm starting this now. I will provide the list of enumerated values soon.

turbomam commented 1 year ago

src/schema/mixs.yaml alredy has this

rel_to_oxygen_enum:
  from_schema: http://w3id.org/mixs/terms
  permissible_values:
    aerobe: {}
    anaerobe: {}
    facultative: {}
    microaerophilic: {}
    microanaerobe: {}
    obligate aerobe: {}
    obligate anaerobe: {}

and

oxy_stat_samp_enum:
  from_schema: http://w3id.org/mixs/terms
  permissible_values:
    aerobic: {}
    anaerobic: {}
    other: {}
turbomam commented 1 year ago

Let's leave the range of oxy_stat_samp as the existing oxy_stat_samp_enum. I don't think it makes sense to describe a sample as any of these

I guess if we found some decisive cutoffs between different oxygenation states, we could update oxy_stat_samp_enum.

ssarrafan commented 1 year ago

Based on recent update will move to new sprint to be closed

mslarae13 commented 1 year ago

@turbomam I'm good with that. Will we leave the 'other' option?

turbomam commented 1 year ago

Yes, I included 'other'. This should be in nmdc-schema 7.6.0 and submission-schema 7.6.0 now. I'll confirm in a few minutes.

turbomam commented 1 year ago

confirmed: submission schema 7.6.0 updated as described

mslarae13 commented 1 year ago

Thanks @turbomam

@pkalita-lbl can we get this change propagated to the submission schema?

pkalita-lbl commented 1 year ago

If I'm reading Mark's comments correctly these changes went into submission schema v7.6.0. A later version of the submission schema (v7.6.5) is already used by the portal codebase but it hasn't been released to production yet. So I would expect you'd be able to see this in dev right now.

ssarrafan commented 10 months ago

Schema updates have been done since so closing this issue.