ncss-tech / soilDB

soilDB: Simplified Access to National Cooperative Soil Survey Databases
http://ncss-tech.github.io/soilDB/
GNU General Public License v3.0
83 stars 19 forks source link

NA should be interpreted as FALSE in .diagHzLongtoWide() #59

Closed dylanbeaudette closed 3 years ago

dylanbeaudette commented 6 years ago

It appears that .diagHzLongtoWide() assigns NA to all columns when a pedon has no diagnostic features. NA should probably be FALSE, since we interpret NULL records as missing for all other interpretation of diagnostic feature records.

Probably related to joining a limited set of peiid records to the full set in @site.

dylanbeaudette commented 6 years ago

Possible solution:

brownag commented 3 years ago

Grave digging a bit here while doing some fetchNASIS work for https://github.com/ncss-tech/soilDB/pull/149.

As Dylan pointed out NA are introduced from join of a null right-hand side (no diagnostic records for particular peiid) to the pedon level where all peiid are present.

I think I would shy away from filling NA as FALSE as proposed above because of how spotty the diagnostics table can be depending on data vintage/origin/purpose.

If folks agree with my interpretations on following points, I think this issue can be closed.

  1. For "validation" purposes it may be valuable to know which pedon diagnostics were NULL (converted to NA) versus pedons where some diagnostics were populated just not that type (FALSE).

    • My work on diagnostic horizon heuristic methods has been centered around the significant need for filling gaps in a standardized way (to answer more "global" questions about e.g. taxon criteria) and also performing basic crosschecks on manually classified/entered data. Saying diagnostics aren't present because none are populated is not ideal.
  2. Any calculated values are an "improvement" over nothing, but the interpretation of those calculated values can and should vary.

    • No manually-entered values means you have to "trust" the computer's interpretation and have no direct way to cross check estimated presence/absence/boundaries without inferring from taxonomic relationships/correlations. Which might be fine, and it might not, again depending on the data. Ergo retaining a difference between NULL input v.s. definitive FALSE seems wise.
  3. Filtering around NA is covered by methods like subset,SoilProfileCollection-method

As an aside, @smroecker had proposed a refactor of this function in https://github.com/ncss-tech/soilDB/issues/158. That might be something to strongly consider for all "extended" data sources in near future / following merge of #149

jskovlin commented 3 years ago

I would agree with your comments, Andrew. I think it is good to be able to make the distinction between NULL input vs. definitive FALSE. We will need to retain that kind of information for future gap filling diagnostics and workflows.

dylanbeaudette commented 3 years ago

Works for me. Another line of reasoning: all pedons should have at least 1 diagnostic horizon / feature record, the complete absence suggests a data population error.