Closed turbomam closed 10 months ago
Can we run a query that asks "of the samples captured in NMDC (mongoDB), do any of the Biosample objects have this slot (or oxy_stat_samp) filled out? If so, what is there?"
Once @turbomam has made a query from NMDC mongoDB, reassign to Montana to check
Only keep rel_to_oxygen. Note in rel_to_oxygen that this is applicable to "Column: oxygenation status of sample".
db.getCollection("biosample_set").find( { part_of : { $exists : true } } );
2449
db.getCollection("biosample_set").find( { rel_to_oxygen : { $exists : true } } );
0
db.getCollection("biosample_set").find( { part_of : { $exists : true } } );
0
Neither rel_to_oxygen
nor oxy_stat_samp
hav been provided for any biosample in the production MongoDB as of this date.
Structured comment name | Item (rdfs:label) | Definition | Expected value | Value syntax | Example | Section | migs_eu | migs_ba | migs_pl | migs_vi | migs_org | mims | mimarks_s | mimarks_c | misag | mimag | miuvig | Preferred unit | Occurence | MIXS ID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rel_to_oxygen | relationship to oxygen | Is this organism an aerobe, anaerobe? Please note that aerobic and anaerobic are valid descriptors for microbial environments | enumeration | [aerobe|anaerobe|facultative|microaerophilic|microanaerobe|obligate aerobe|obligate anaerobe] | aerobe | nucleic acid sequence source | - | C | - | - | - | X | X | C | X | X | - | 1 | MIXS:0000015 |
Environmental package | Structured comment name | Package item | Definition | Expected value | Value syntax | Example | Requirement | Preferred unit | Occurrence | MIXS ID |
---|---|---|---|---|---|---|---|---|---|---|
agriculture | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic] | C | 1 | MIXS:0000753 | ||
air | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
host-associated | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
human-associated | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
human-gut | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
human-oral | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
human-skin | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
human-vaginal | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
hydrocarbon resources-cores | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
hydrocarbon resources-fluids/swabs | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
microbial mat/biofilm | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
miscellaneous natural or artificial environment | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
plant-associated | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
sediment | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
symbiont-associated | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample. | enumeration | [aerobic|anaerobic] | aerobic | X | 1 | MIXS:0000753 | |
wastewater/sludge | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 | |
water | oxy_stat_samp | oxygenation status of sample | Oxygenation status of sample | enumeration | [aerobic|anaerobic|other] | aerobic | X | 1 | MIXS:0000753 |
@mslarae13 I agree that we should only use one of rel_to_oxygen
or oxy_stat_samp
for NMDC biosamples. I would be more inclined to use oxy_stat_samp
since it is supposed to be about samples and rel_to_oxygen
is supposed to be about organisms (for checklists like MIGS-ba?)
After deciding that, we should combine the values from the rel_to_oxygen
enumeration, the oxy_stat_samp
enumeration and the values found the NCBI's biosample_set into one reasonable NMDC enumeration.
Here are the oxy_stat_samp
values in BBOP's relational version of NCBI's biosample_set
select
value,
count(1)
from
all_attribs aa
where
aa.harmonized_name = 'oxy_stat_samp'
group by
value
order by
count(1) desc ;
value | count |
---|---|
aerobic | 4420 |
NA | 4332 |
anaerobic | 3890 |
not collected | 2730 |
not applicable | 2165 |
missing | 1757 |
anaerobe | 535 |
NOT APPLICABLE | 418 |
0,00 | 144 |
none | 123 |
N/A | 39 |
aerobe | 36 |
Unknown | 26 |
not collecte | 24 |
Not collected | 17 |
Not available | 16 |
5,03 | 6 |
7,37 | 6 |
14,75 | 6 |
10,45 | 6 |
4,89 | 6 |
7,07 | 6 |
unknown | 6 |
7,50 | 6 |
4,84 | 6 |
10,56 | 6 |
10,06 | 6 |
7,76 | 6 |
3,53 | 6 |
9,67 | 6 |
2,34 | 6 |
6,80 | 6 |
5,22 | 6 |
2,45 | 6 |
4,67 | 6 |
13,60 | 6 |
5,28 | 6 |
3,75 | 6 |
15,51 | 6 |
8.62 | 3 |
Not applicable | 2 |
7,29 | 2 |
17,24 | 2 |
17,24 mg/L | 1 |
not provided | 1 |
the sediment is anoxic but the water colum contains O2 (6.78-7.66 mg/L) | 1 |
If we use oxy_stat_samp
, maybe the following really would be adequate
The other oxy_stat_samp
values boil down to either some variant of NA
or a concentration of oxygen, presumably in mg/L
Adding to current sprint per Mark. Need feedback from @mslarae13
I would be more inclined to use oxy_stat_samp since it is supposed to be about samples and rel_to_oxygen is supposed to be about organisms (for checklists like MIGS-ba?)
- I'm good with that!
After deciding that, we should combine the values from the rel_to_oxygen enumeration, the oxy_stat_samp enumeration and the values found the NCBI's biosample_set into one reasonable NMDC enumeration.
- Also agree with providing the full enumeration list. @turbomam
@mslarae13 I'm starting this now. I will provide the list of enumerated values soon.
src/schema/mixs.yaml alredy has this
rel_to_oxygen_enum:
from_schema: http://w3id.org/mixs/terms
permissible_values:
aerobe: {}
anaerobe: {}
facultative: {}
microaerophilic: {}
microanaerobe: {}
obligate aerobe: {}
obligate anaerobe: {}
and
oxy_stat_samp_enum:
from_schema: http://w3id.org/mixs/terms
permissible_values:
aerobic: {}
anaerobic: {}
other: {}
Let's leave the range
of oxy_stat_samp
as the existing oxy_stat_samp_enum
. I don't think it makes sense to describe a sample as any of these
I guess if we found some decisive cutoffs between different oxygenation states, we could update oxy_stat_samp_enum
.
Based on recent update will move to new sprint to be closed
@turbomam I'm good with that. Will we leave the 'other' option?
Yes, I included 'other'. This should be in nmdc-schema 7.6.0 and submission-schema 7.6.0 now. I'll confirm in a few minutes.
confirmed: submission schema 7.6.0 updated as described
Thanks @turbomam
@pkalita-lbl can we get this change propagated to the submission schema?
If I'm reading Mark's comments correctly these changes went into submission schema v7.6.0. A later version of the submission schema (v7.6.5) is already used by the portal codebase but it hasn't been released to production yet. So I would expect you'd be able to see this in dev right now.
Schema updates have been done since so closing this issue.
This illustrates approaches for repairing columns with enumerations of permissible values, also known as controlled vocabularies. Look for 'enumeration' in the MIxS'
Expected value
column, or a range of '*_enum' in the LinkML model. See reference material below.Related code: sample_annotator/rel_to_oxygen_example.py
Permissible values
Reference material
rel_to_oxygen
rel_to_oxygen
Observed, with matches
Easy fixes:
Trickier!
Probably not justified when the count is really low, like 1
Gotchas:
aerobe
is a noun: a microorganism that requires the presence of oxygenaerobic
is an adjective which can be applied to an organismoxic
is an adjective that describes the water an organism lives in, not the organism itself