microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

Usage of slots and columns across authorities #398

Open turbomam opened 2 years ago

turbomam commented 2 years ago

Gather input, code and output for

turbomam commented 2 years ago

MIxS soil vs Montana's Example Soil Google Sheet

I did an accounting of the differences between Montana's Example Soil Google Sheet and my (MIxS-based) DataHarmonizer Soil template. It started out as a Python script but became pretty manual

code: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis.ipynb

computed output: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis.json

curated output: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis_curated.json

turbomam commented 2 years ago

Per-package slot usage within INSDC Biosamples vs MIxS guidance

code: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/insdc_per_package_column_usage.ipynb

Three sub-steps are included in the notebook

  1. Determine what kind of annotations are applied to the INSDC Biosamples on a package-by-package basis. These are called "columns" in the code, but that isn't quite right. Technically, they start out as attributes in biosample_set.xml.gz. Only those attributes that have a harmonized name are included. This is then merged with a table of which MIxS slots (which largely overlap with the attributes) are associated with each package. With that, you can query for the slots that MIxS associates with packages but are least frequently used in INSDC Biosamples, or for the attributes that are most frequently used for samples from some package in INSDC, even though MIxS doesn't associate that slot with that package.
  2. PCA plot of the per-package Biosample attribute usage. Intended as a conversation starter: there are 20M biosamples but only 251k are annotated with an env_package. Could we use the attribute usage alone to predict the packages for all of those other samples?
  3. What are the common values for all of those attributes? Determined with Pandas Profiling