turbomam / biosample-basex

Using the Base-X XML database to discover structure in NCBI's Biosample database
1 stars 0 forks source link

subset whole database on a few repaired env packages for Data Good #37

Closed turbomam closed 8 months ago

turbomam commented 2 years ago

Subsetting rows based on environmental package

create table if not exists non_attribute_metadata_sel_envs
as
select
    nam.*
from
    non_attribute_metadata nam
join harmonized_wide_repaired hwr 
    on
    nam.raw_id = hwr.raw_id
join harmonized_wide hw
    on
    nam.raw_id = hw.raw_id
where
    hwr.env_package in ('plant-associated', 'soil', 'sediment', 'water')
;
turbomam commented 2 years ago

After propagating row wise constraints (see below) recreate indices

turbomam commented 2 years ago

How should the environmental package constraints be propagated to other tables?

Sample propagation statement

create table if not exists harmonized_wide_sel_envs
as
select
    hw.*
from
    harmonized_wide hw
join non_attribute_metadata_sel_envs namse on
    namse.raw_id = hw.raw_id;
turbomam commented 2 years ago

Limit to certain columns, too?

Exhaustive list of columns from harmonized_wide (There are minable columns in table XXX, too)

PRAGMA table_info('harmonized_wide') ;

Proposed harmonized_wide columns for Data Good to mine:

name expected type notes
agrochem_addition    
air_temp    
air_temp_regm    
altitude    
bacteria_carb_prod    
barometric_press    
chem_administration    
chloride    
collection_date    
collection_device    
collection_method    
cur_land_use    
cur_vegetation    
cur_vegetation_meth    
depth    
description    
diss_carb_dioxide    
diss_inorg_nitro    
diss_inorg_phosp    
diss_org_nitro    
drainage_class    
elev    
env_broad_scale    
env_local_scale    
env_medium    
env_package    
experimental_factor    
extreme_event    
extreme_salinity    
fao_class    
filter_type    
genotype    
geo_loc_exposure    
geo_loc_name    
growth_med    
heavy_metals    
heavy_metals_meth    
height_or_length    
horizon    
horizon_meth    
host_taxid    
humidity    
humidity_regm    
lat_lon    
link_addit_analys    
misc_param    
nitrate    
nitrite    
nitro    
oxy_stat_samp    
pesticide_regm    
ph    
ph_meth    
ph_regm    
phosphate    
plant_product    
previous_land_use    
previous_land_use_meth    
raw_id    
rel_to_oxygen    
salinity    
salinity_meth    
same_as    
samp_collect_device    
samp_salinity    
samp_store_loc    
samp_store_temp    
sample_name    
slope_aspect    
slope_gradient    
soil_type    
soil_type_meth    
soluble_react_phosp    
source_material_id    
source_name    
submitted_sample_id    
submitted_subject_id    
temp    
tillage    
tissue    
tot_phosp    
tot_phosphate    
water_content_soil    
water_content_soil_meth    
wind_direction    
wind_speed    
turbomam commented 2 years ago

@turbomam todo

turbomam commented 2 years ago

Job 1

Identify row values that don't comply with the expected categorical values for a column

See MIxS Google Sheets

turbomam commented 2 years ago

Invite to sample annotator repo

rules:

  1. create an issue
  2. create a test
  3. create an issue named after the issue
  4. commit code to the branch
  5. create a PR
  6. pick reviewers
turbomam commented 2 years ago

One performance metric will improved classification, which the BBOP staff will develop