ncss-tech / aqp

Algorithms for Quantitative Pedology
http://ncss-tech.github.io/aqp/
54 stars 15 forks source link

horizon level reporting of depth logic errors in checkHzDepthLogic #197

Closed dylanbeaudette closed 3 years ago

dylanbeaudette commented 3 years ago

Per recent discussion with @smroecker and @brownag, checkHzDepthLogic would benefit from a new argument that enables horizon-level reporting of errors. This commit contains a nice example of how this kind of information might be reported.

Given the recent optimizations in checkHzDepthLogic, this might take a little bit of time / effort to fully integrate. @brownag since you are the most familiar with this function and optimizations, what do you think that it would take to add this functionality?

brownag commented 3 years ago

My personal preference would be to refactor the function hzDepthTests to optionally return horizon-level checks (i.e. no all()/any()) and provide info using same standard formats and names-- a data.frame with logical columns and connotative test names. With idname() and hzidname() as needed. After doing the refactor, we could use the modern {data.table} code of checkHzDepthLogic more or less "as-is"--to do profile level or horizon level summary. The default will be aggregated as site-level parameter as this is backwards-compatible, efficient, and adequate for identification of presence/absence all sorts of data problems.

The output would emulate the kind of output we we use for spc_in_sync() "integrity" method and the "validity" methods checkHzDepthLogic/hzDepthTests.

If we need to add new test types this is possible, but I am not sure it is necessary based on what I see in above commit. I have considered in the past treating ALL NA depth as its own (5th) type of error. But it is "problematic" as NA can affect other logical tests, so they may need to be "managed" to get "accurate" or decisive results about error origin. Possibly more effort than its worth -- unless applying specific fixes (some other function).

Most implementations using this logic-checking code out there that I am aware of should be robust to addition of a new test "name" -- but it depends on how and how much users are making use of the specific test results, so I would only add new if absolutely necessary.

In @smroecker's approach the same basic logical comparisons are being done as checkHzDepthLogic but more intermediate results are retained for the user. From my view there is no reason to duplicate these logical expressions AGAIN -- they were abstracted out for a reason, and can be further abstracted based on the horizon level needs and this suggestion. There is absolutely nothing novel or groundbreaking about the logic itself -- but keeping it in one place is the whole point here.

Including more intermediate results increases the memory footprint -- see benchmarks showing this phenomenon on the related issue #174 -- but is still useful regardless of efficiency on large sets. But to be clear -- efficiency on large sets is/has been a major design consideration for [upgrades to] the function so far.

I think the commit is a fine demonstration of needs for some better localization of specific subtypes of {aqp} depthLogic error. We have an excellent conceptual addition that will help be at the backbone of new QC functions (main soilDB discussion: https://github.com/ncss-tech/soilDB/discussions/160). I would encourage folks who are working on things like this on their own to take part in the relevant discussions and or make proposals to tackle the numerous long-standing issues that are related. These things do not happen in silos. Dylan and I had significant in-person discussions when checkHzDepthLogic replaced test_hz_logic.


It is not ideal that I named the abstracted function hzDepthTests early last year.

Despite the name, what was covered historically by test_hz_logic and checkHzDepthLogic are profile or site-level aggregate logic checks. For the express purpose of creating a SITE level index of profile "validity". Which is most commonly the use case in the past -- with the goal of preparing SPC for input to logic-sensitive functions like slice/slab/glom/profile_compare etc. Identifying the nature or origin of a logic problem to "fix" it is significantly more complex IN MY OPINION than simply identifying them, so I am extremely leery of naive checks that fix things automatically with no user review. But there is definitely some low hanging fruit here that can be better handled to give the user more actionable information from aqp::checkHzDepthLogic

Perhaps a more appropriate name for the current workhorse function of checkHzDepthLogic would be profileDepthTests, which in turn applies the simple logical comparisons from the above commit at the site, and hzDepthTests would be horizon-level. However, the old name is what it is, and I don't think there should be two hard-to-distinguish-purpose-of functions.

The horizon-level output is valuable and a new argument to the existing validity functions could easily alter the tabular output from checkHzDepthLogic to return site or horizon level index. This could then be used by "fixer" functions -- which would have several options for parameterizing and documenting user modifications of data.

brownag commented 3 years ago

200 was merged for basic changes to checkHzDepthLogic to allow logical output for tests on a horizon basis. I think this covers the needs outlined in this issue, though as I stated above my preference was to refactor hzDepthTests.

I originally created hzDepthTests because checkHzDepthLogic did not completely replace the interface to test_hz_logic but was a wrapper around it. This was a function that took two input vectors -- top and bottom depth, implied to be from a single profile. I recently extended it to work with a single input of data.frame type containing the two columns, to support data.table .SD-based syntax.

The implementation from #200 works, but is probably not as efficient as what the refactored version of this method could be.

Also overlapOrGap tests are still run/reported on single-horizon input data -- this is both a waste of memory and possibly misleading.

dylanbeaudette commented 3 years ago

Thanks. We can further optimize down the line.