microbiomedata / nmdc-ontology

Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Set analysis of ENVO terms used in INSDC Biosample Metadata vs curated subsets spreadsheet #3

Open turbomam opened 3 years ago

turbomam commented 3 years ago

AKA

https://www.ncbi.nlm.nih.gov/biosample/

vs

EnvO_triad_terms_MIxS_soil_package_review_05182021

@cmungall @elishawc @wdduncan There was some unfinished discussion about which repo was the best home for this issue

turbomam commented 3 years ago

@wdduncan @cmungall @elishawc @jagadishcs @mslarae13

We could meet to discuss these findings at the Wednesday July 14th NMDC Sync meeting, or sooner if you prefer.


Here's my analysis of the EnvO terms recommended per package and MIxS slot, vs the count of EnvO terms found in the INSDC BioSample metadata, after being repaired/mapped.

Out of 921 rows in my table, you will find that there are ~ 200 EnvO terms that were recommended for a package/slot combo, but never used in that way in the INSDC BioSample metadata

package slot class label count reccomended
water env_local_scale ENVO:00000061 underground water body 0.0 True

and ~ 550 EnvO terms that appeared in the INSDC BioSample metadata in combination with a package/slot combo at least twice, but not explicitly recommended in the recent review.

package slot class label count reccomended
soil env_medium ENVO:00001998 soil 13249.0 False

In summary, for 921 combinations of packages, slots and EnvO terms, only 80 were both recommended by the review team and observed at least two times in the INDC dataset.

200 + 550 + 80 != 921 due to rounding and my exclusion of combinations that were only observed once in the INDC dataset.

The analysis was implemented in this notebook

turbomam commented 3 years ago

Confirmation, especially regarding the repair process?

Run something like this against the July build of the harmonized data biosample SQLite database:

SELECT
    scoping_col,
    scoping_value,
    biosample_col_to_map,
    raw,
    consensus_id,
    consensus_lab,
    count(1) as sample_count
from
    repaired_long rl1
where
    scoping_col = 'env_package_normalization.EnvPackage'
    and scoping_value = 'soil'
    and biosample_col_to_map = 'env_medium'
    and consensus_id = 'ENVO:00001998'
group by
    scoping_col,
    scoping_value,
    biosample_col_to_map,
    raw,
    consensus_id,
    consensus_lab
order by
    count(1) desc;
scoping_col scoping_value biosample_col_to_map raw consensus_id consensus_lab sample_count
env_package_normalization.EnvPackage soil env_medium soil ENVO:00001998 soil 12596
env_package_normalization.EnvPackage soil env_medium Soil ENVO:00001998 soil 439
env_package_normalization.EnvPackage soil env_medium ENVO:soil ENVO:00001998 soil 214
turbomam commented 3 years ago

PS I only used the Subset_EnvO_Broad_Local_Medium_terms_062221 tab from EnvO_triad_terms_MIxS_soil_package_review_05182021