Open turbomam opened 2 years ago
I did an accounting of the differences between Montana's Example Soil Google Sheet and my (MIxS-based) DataHarmonizer Soil template. It started out as a Python script but became pretty manual
computed output: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis.json
curated output: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis_curated.json
Three sub-steps are included in the notebook
biosample_set.xml.gz
. Only those attributes that have a harmonized name are included. This is then merged with a table of which MIxS slots (which largely overlap with the attributes) are associated with each package. With that, you can query for the slots that MIxS associates with packages but are least frequently used in INSDC Biosamples, or for the attributes that are most frequently used for samples from some package in INSDC, even though MIxS doesn't associate that slot with that package.
env_package
. Could we use the attribute usage alone to predict the packages for all of those other samples?
Gather input, code and output for