ncss-tech / soilDB

soilDB: Simplified Access to National Cooperative Soil Survey Databases
http://ncss-tech.github.io/soilDB/
GNU General Public License v3.0
84 stars 19 forks source link

encoding factors in fetch* functions #241

Open dylanbeaudette opened 2 years ago

dylanbeaudette commented 2 years ago

A couple of thoughts:

Chain of functionality:

  1. read data as text
  2. uncode() all coded columns, optionally converting to factors using metadata (`encodeFactors='all')
  3. selective encoding (encodeFactors='some') or none (encodeFactors='none')
  4. ???
# in all functions get data from NASIS
x<- query()
y <- uncode(x, encodeFactors)
return(y)
# high level functions like fetchNASIS()
x <- getXXX_from_NASIS(encodeFactors)
if(encodeFactors='some') {
  .setupNASIS_factors(...)
}

Apart from the compatibility issue with a pending version of R, there are no reasons why we can't all get what we want out of NASIS. The factor-conversion code can be written to look for NASIS column names, and encode levels according to either the metadata or a manually-specified vector. An invert argument can be added to reverse factor levels which is sometimes handy. That said, I don't think that we should attempt to convert all character data → factors (e.g. parent material origin) by default, just those that are most commonly used as factors (texture class, hillslope position, drainage class, etc.).

The new function / functions will likely be internal to soilDB, and will "know" how to exclude IDs.

# x: data.frame
# all: encode all character data, or just those manually defined in the function
# invert: invert factor levels / ordering
# drop: drop unused levels
.setupNASIS_factors <- function(x, all = FALSE, invert = FALSE, drop = TRUE) {

  # all = TRUE
  # use NASIS metadata

  # all = FALSE
  # use column-specific rules as follows
  # ...

  # drop = TRUE
  # drop unused levels, no matter the encoding strategy above

  # modified data.frame is returned
  return(res)
}

Finally, I suggest that fetchNASIS() should default to:

brownag commented 2 years ago

I added two new domain attributes to the query used by uncode() (in .get_NASIS_metadata()) for use in future functions.

MetadataDomainMaster.DomainRanked

capability_class, corrosion_concrete, corrosion_uncoated_steel, flooding_duration_class, flooding_ponding_month, potential_frost_action, soil_erodibility_factor, wind_erodibility_index, drainage_class, excavation_difficulty_class, soil_slippage_potential, ponding_duration_class, pore_continuity_vertical, rupture_resist_block_cem, wildlife_rating, mapunit_hel_class, flooding_frequency_class, ponding_frequency_class, date_time_interval_qualifier, erosion_class, fl_soil_leaching_potential, fl_soil_runoff_potential, runoff, taxonomic_family_c_e_act_class, va_soil_management_group, va_soil_productivity_group, bedrock_fracture_interval_class, boundary_distinctness, color_chroma, color_value, concen_redox_boundary, effervescence_class, concen_rmf_mottle_contrast, penetration_resistance, permeability_class, plasticity, pore_root_size, pvsf_distinctness, rupture_resist_block_dry, rupture_resist_block_moist, rupture_resist_plate, stickiness, structure_grade, structure_size, toughness_class, weathering, dmu_investigation_intensity, soil_taxonomy_edition, ia_subsoil_k, ia_subsoil_p, nj_farmland_assessment, Datetime Precision (NASIS 6 Metadata), sat_hyd_conductivity_class, soil_odor_intensity, texture_structure_category, crust_development_class, carbonate_dev_stage_cf, carbonate_dev_stage_fe, pore_quantity_class, abundance_class, canopy_cover_class, cryptogam_cover_class_legacy, cultivation_extent, current_year_precip, damage_degree, daubenmire_canopy_cover_class, decadent_plant_abundance, disturbance_impact, forest_stand_quality, ground_cover_class, ground_cover_extent, growing_season_rating, gully_rill_presence, invading_plants, pci_concentration_areas, pci_desirable_plants, pci_ground_cover_residue, pci_gully_erosion, pci_legume_pct_class, pci_plant_cover, pci_plant_diversity, pci_plant_vigor, pci_sheet_rill_erosion, pci_soil_compaction, pci_standing_dead_forage, pci_stream_shore_erosion, pci_use_uniformity, pci_wind_erosion, plant_density_class, reference_yield_rank, reproduction_abundance_class, rhi_annual_production, rhi_bare_ground, rhi_compaction_layer, rhi_erosion_resistance, rhi_functional_struct_groups, rhi_gullies, rhi_infiltration_runoff, rhi_invasive_plants, rhi_litter_amount, rhi_litter_movement, rhi_pedestals_terracettes, rhi_plant_mortality, rhi_reproductive_capability, rhi_rills, rhi_soil_surf_degradation, rhi_summary, rhi_water_flow_patterns, rhi_wind_scour_areas, salinity_class, sampling_intensity, seedling_abundance, sociability_class, soil_compaction, soil_crusting, soil_degradation, soil_surface_erosion, stocking_rate, suppression_degree, tree_condition, vigor_class, ak_ecological_site_status, ak_stratum_cover_class, ak_functional_group, ak_crown_class, ak_grazing_plant_group, rosgen_stream_subclass, ak_grazing_impact, observation_intensity, von_post_humification_scale, osd_text_kind, burn_intensity, crop_arrangement, dominant_vegetation, growth_status, harvest_skidding_method, type_of_burn, years_in, yrs_since_harvest, yrs_since_last_burn, burn_frequemcy, fertility_tests_done, dsp_site_type

See for example ponding frequency class the ChoiceSequence is not the same as the ChoiceValue. Notably the ordering includes the obsolete values. In this case the obsolete class "Common" has a value (5) that does not match sequence position (4) in the set.

image


MetadataDomainMaster.DisplayLabel

hydric_condition, nasis_site_office_type, farmland_classification, state_fips_code_alpha, texture_class, texture_modifier, unified_soil_classification, terms_used_in_lieu_of_texture, mapunit_hel_class, erosion_class, nh_important_forest_soil_group, logical_data_type_nasis, sort_type, site_index_curves, legend_suitability_for_use, mou_agency_responsible, ecological_site_mlra, mapunit_text_kind, legend_certification_status, dmu_certification_status, export_certification_status, hydric_soil_indicator, farmland_class_secondary, mapunit_type, cardinality_nasis, column_alignment, default_type, saf_cover_type, sort_direction, soil_type_conversion