Usage of slots and columns across authorities

turbomam commented 2 years ago

Gather input, code and output for

[x] MIxS soil vs Montana's Example Soil Google Sheet
[x] Per-package slot usage within INSDC Biosamples vs MIxS guidance
[ ] review of @mslarae13's Index of Terms for a potiential EMSL DataHarmonizer template

turbomam commented 2 years ago

MIxS soil vs Montana's Example Soil Google Sheet

I did an accounting of the differences between Montana's Example Soil Google Sheet and my (MIxS-based) DataHarmonizer Soil template. It started out as a Python script but became pretty manual

Determining which columns are "associated" with soil samples in Montana's sheet. I don't have a definition for associated. It could mean required, recommended, optional. I would say that agrochem_addition is associated with soil samples in cells A292:B292 of the MenuTerms tab. I made the associations with manual inspection of the MenuTerms tab. I have some preliminary code to do that with searching the sheet for keywords clustering the locations to find dense mentions of agrochem_addition or soil . I think the associations could also be determined by analyzing the formulae in tabs like Metadata and EnvironmentalMetadata but I don't know how to do that.
Looking for patterns that explain the mismatches between Montana's soil-associated columns and MIxS' soil slots. Since it started out programmatically, the output is JSON. I added structure and notes to explain the mismatches. (edited)

code: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis.ipynb

computed output: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis.json

curated output: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis_curated.json

turbomam commented 2 years ago

Per-package slot usage within INSDC Biosamples vs MIxS guidance

code: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/insdc_per_package_column_usage.ipynb

Three sub-steps are included in the notebook

Determine what kind of annotations are applied to the INSDC Biosamples on a package-by-package basis. These are called "columns" in the code, but that isn't quite right. Technically, they start out as attributes in biosample_set.xml.gz. Only those attributes that have a harmonized name are included. This is then merged with a table of which MIxS slots (which largely overlap with the attributes) are associated with each package. With that, you can query for the slots that MIxS associates with packages but are least frequently used in INSDC Biosamples, or for the attributes that are most frequently used for samples from some package in INSDC, even though MIxS doesn't associate that slot with that package.
- Output
  - https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/insdc_per_package_column_usage.tsv
PCA plot of the per-package Biosample attribute usage. Intended as a conversation starter: there are 20M biosamples but only 251k are annotated with an env_package. Could we use the attribute usage alone to predict the packages for all of those other samples?
- Output (raw and curated)
  - https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/insdc_per_package_column_usage.pdf
  - https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/insdc_per_package_column_usage_untangled.pdf
What are the common values for all of those attributes? Determined with Pandas Profiling
- Output below. You'll have to download the HTML report or clone the repo, as opposed to viewing it in the GitHub web interface.
  - https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/insdc_per_package_column_usage_profile.html
  - https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/insdc_per_package_column_usage_profile.json

microbiomedata / nmdc-metadata

Usage of slots and columns across authorities #398

MIxS soil vs Montana's Example Soil Google Sheet

Per-package slot usage within INSDC Biosamples vs MIxS guidance