ncss-tech / SoilTaxonomy

A System of Soil Classification for Making and Interpreting Soil Surveys
https://ncss-tech.github.io/SoilTaxonomy/
GNU General Public License v3.0
14 stars 2 forks source link

`parse_family()` with taxa above family #45

Closed brownag closed 8 months ago

brownag commented 10 months ago

Consider the following example using parse_family() on a taxonomic class field of mixed levels.

Note where taxclname is: 1) a suborder-level name 2) a subgroup-level name 3) great group-level name with some family level classes specified

library(SoilTaxonomy)
suppressPackageStartupMessages(library(soilDB))

x <- data.frame(
      taxonname = c("Alberti", "Aquents", "Lithic Xeric Torriorthents", "Stagy Family", "Haplodurids"),
      taxonkind = c("series", "taxon above family", "taxon above family", "family", "taxon above family"),
      taxclname = c(
        "Clayey, smectitic, thermic, shallow Vertic Rhodoxeralfs",
        "Aquents", 
        "Lithic Xeric Torriorthents", 
        "Coarse-loamy, mixed, mesic Duric Haploxerolls",
        "Mixed, superactive, thermic Haplodurids"
    ))
parse_family(x$taxclname)
#>                                                    family
#> 1 Clayey, smectitic, thermic, shallow Vertic Rhodoxeralfs
#> 2                                                 Aquents
#> 3                              Lithic Xeric Torriorthents
#> 4           Coarse-loamy, mixed, mesic Duric Haploxerolls
#> 5                 Mixed, superactive, thermic Haplodurids
#>                     subgroup subgroup_code                        class_string
#> 1        vertic rhodoxeralfs          JDEB Clayey, smectitic, thermic, shallow
#> 2                       <NA>          <NA>                                <NA>
#> 3 lithic xeric torriorthents          LECB                                    
#> 4         duric haploxerolls          IFFZ          Coarse-loamy, mixed, mesic
#> 5                       <NA>          <NA>                                <NA>
#>   classes_split  taxpartsize taxpartsizemod taxminalogy taxceactcl taxreaction
#> 1  Clayey, ....       clayey             NA   smectitic         NA          NA
#> 2            NA         <NA>             NA        <NA>         NA          NA
#> 3                       <NA>             NA        <NA>         NA          NA
#> 4  Coarse-l.... coarse-loamy             NA       mixed         NA          NA
#> 5            NA         <NA>             NA        <NA>         NA          NA
#>   taxtempcl taxfamhahatmatcl taxfamother                  taxsubgrp
#> 1   thermic               NA     shallow        Vertic Rhodoxeralfs
#> 2      <NA>               NA        <NA>                       <NA>
#> 3      <NA>               NA        <NA> Lithic Xeric Torriorthents
#> 4     mesic               NA        <NA>         Duric Haploxerolls
#> 5      <NA>               NA        <NA>                       <NA>
#>     taxgrtgroup taxsuborder  taxorder
#> 1  Rhodoxeralfs     Xeralfs  Alfisols
#> 2          <NA>        <NA>      <NA>
#> 3 Torriorthents    Orthents  Entisols
#> 4  Haploxerolls     Xerolls Mollisols
#> 5          <NA>        <NA>      <NA>

Should this be handled differently? Currently the derived NASIS-like columns e.g. taxsuborder are from decomposing a valid (current taxonomy) subgroup level name, so they return NA for taxon above family that aren't subgroup-level.

Questions: - Is it "valid" to apply family-level classes to taxa above subgroup? - If there is a detectable taxon above subgroup should it be split out? - Should family level classes also be returned (even if not "valid")? - How often are family level taxa combined with taxa above subgroup in SSURGO? In practice SSURGO components that are taxa above subgroup usually are constrained to one or more family classes e.g. PSC, temperature regime, which can sometimes be cleanly expressed using something like the family level class format. I suppose these can be interpreted as specifications about groups of related families... but it may be that it is confusing without some sort of a wildcard character, and splits of taxa above subgroup should be based on phases (outside scope of package).
brownag commented 8 months ago

This has been addressed in #46

Should this be handled differently? Currently the derived NASIS-like columns e.g. taxsuborder are from decomposing a valid (current taxonomy) subgroup level name, so they return NA for taxon above family that aren't subgroup-level.

Now taxa at any level are returned. Two additional columns are added "taxclname" and "code"--these refer to the input taxonomic class and lowest-level letter code (order, suborder, great group or subgroup).

  • Is it "valid" to apply family-level classes to taxa above subgroup?

Yes, it is common for higher taxonomic concepts to have specific family level classes associated with them. For instance the temperature regime or particle size class.

  • If there is a detectable taxon above subgroup should it be split out?

Yes, and to avoid confusion the subgroup_code and lowest-level (not necessarily subgroup) code are both returned. For taxa above subgroup the value is NA for any levels that are not defined in the input.

  • Should family level classes also be returned (even if not "valid")?

We don't currently have the logic to determine which family level classes are required for particular taxa. The ability to validate whether classes used are appropriate for particular subgroup or higher level taxa could be within the purview of a new function validate_family() or similar.

  • How often are family level taxa classes combined with taxa above subgroup in SSURGO?

Some quick queries indicate that more often than not a taxon above family is associated with one or more family-level classes. 70% of taxon above family components have taxpartsize and/or taxtempregime.

suppressPackageStartupMessages(library(soilDB))

SDA_query("SELECT COUNT(DISTINCT cokey) FROM component 
           WHERE compkind = 'taxon above family'")
#> single result set, returning a data.frame
#>      V1
#> 1 37312

SDA_query("SELECT COUNT(DISTINCT cokey) FROM component 
           WHERE compkind = 'taxon above family'
           AND taxpartsize IS NOT NULL")
#> single result set, returning a data.frame
#>      V1
#> 1 20240

SDA_query("SELECT COUNT(DISTINCT cokey) FROM component 
           WHERE compkind = 'taxon above family'
           AND taxtempregime IS NOT NULL")
#> single result set, returning a data.frame
#>      V1
#> 1 23840

SDA_query("SELECT COUNT(DISTINCT cokey) FROM component 
           WHERE compkind = 'taxon above family'
           AND (taxpartsize IS NOT NULL OR taxtempregime IS NOT NULL)")
#> single result set, returning a data.frame
#>      V1
#> 1 26621