ncss-tech / soilDB

soilDB: Simplified Access to National Cooperative Soil Survey Databases
http://ncss-tech.github.io/soilDB/
GNU General Public License v3.0
81 stars 20 forks source link

fetchNASIS phlabresults join issue #192

Closed hammerly closed 3 years ago

hammerly commented 3 years ago

When using fetchNASIS(from = "pedons", lab = TRUE) on pedons which contains > 1 row in the phlabresults table, it results in a join error, and fails to fetch any results. In most cases, this is probably desired, as it prevents other issues, and may identify data quality issues. A small percentage of pedons however, do contain lab results for specific depths within a horizon, in these situations, it might be useful to still fetch them and potentially analyze separately, or combine the results as a weighted average. Another edge case happens if the results from lab analysis for the same horizon are different than expected, the scientist may run the soil sample through the lab analysis again and obtain different results for the same sample and add an additional row to the phlabresults table for that horizon. In this case, it might be appropriate to take the newest result. This issue is similar to the issue identified in #120 for the phsample table. For examples, load upedonid: 1964IA113001, 1975IA045012. A possible remedy to this issue would be to implement something similar to the rmHzErrors=FALSE argument but specifically for the phlabresults data. This issue was briefly discussed with @smroecker.

brownag commented 3 years ago

Thanks for bringing this up @hammerly; some things were not working as they should. I made some changes in #195 that should address your immediate needs and work out-of-the-box for your two pedons. Now the original weighted averaging/dominant logic works as expected with the following notice for your example pedons:

NOTICE: multiple phiid values exist in the `phlabresults` table, computing weighted averages and dominant values based on horizon thickness

If we allow the lab data to be joined in at the beginning of fetchNASIS() (before SPC is created) then we can apply the existing horizon depth checking logic in the context of "errors" that might be introduced by joining (optional) phlabresults data. Basically allowing rmHzErrors=FALSE to work as expected for lab data, and allowing existing code for flattening to be called.

If you can please install off the https://github.com/ncss-tech/soilDB/tree/phlabresultswt branch and see if this resolves short term issues with fetchNASIS() erroring out.

remotes::install_github("ncss-tech/soilDB@phlabresultswt", dependencies=FALSE)

Now here are some things I am thinking about regarding more detailed aggregation options.

From your suggestions I see three proposed aggregation methods: "most recent", "weighted average" and "none"; and I support these ideas

  1. Timestamp column in soilDB:::.get_phlabresults_data_from_NASIS_db() result to allow for "most recent" aggregation.

    • As it stands the only time is recwlupdated... which is really not a good candidate to keep track of the most recent analysis IMO; it is prone to error, at least. This one will have to wait
  2. Technical replicates for variability in the analytical method and depth repeated measures for variability within horizon can be determined using unique combinations of sampledepthtop sampledepthbottom and sampleid.

    • We could do weighted averages on such samples in two steps. First, weighted average across sampleid within each stratum, then second do weighted average across strata to get a "JOIN safe" single record per horizon. This is more restrictive than the current logic in .get_phlabresults_data_from_NASIS_db() which currently will weighted-average your "Rerun ####" sampleswith your #### samples in 1975IA045012. I am cautious to encode specifics about the "reruns" as sampleid is essentially a free form text field and I don't know that all labs do this the same.
  3. The case where no aggregation is done would create depth logic errors by default since sampledepthtop and sampledepthbottom do not define the SPC logic. With the code to do weighted averaging of any duplicated records by "phiid" that is fixed. Taking no aggregation further to where new records with top and bottom depth sampledepthtop sampledepthbottom get swapped in for a subset of duplicated horizon depths is a bit more complicated but can be done and would make it easier to analyze the disaggregated contents of phlabresults in fetchNASIS() workflows.

hammerly commented 3 years ago

Thanks @brownag the branch resolves the issue with the fetch erroring out. Your proposed aggregation methods would be excellent features to add if we can figure out and agree on a way to implement them. For the "most recent" method, you are right, the recwlupdated can be problematic. For instances where data is present in the phhydrometeranalysis table, we could use testdate which would be more reliable than recwlupdated, but that still leaves the other methods without a date to compare. For the weighted averages, unless we can agree on and implement some standard way of capturing a particular sample has been re-analyzed or rerun, having a notice returned to the user may be the best we can do at this point. At least the notice should indicate the user should review the data and remove the row that should not be included in the fetch, if there are instances of reruns that are unreasonable to keep in the analysis. The instances of no aggregation seems a bit complex to implement as you explain above. In my own experience, I typically think of when a control section is sampled exclusively as the reason for wanting the no aggregation method. I'm unsure how often the phlabresults table is used for capturing that information.

brownag commented 3 years ago

Thanks @hammerly! I wasn't considering the child tables; good point for at least some of the analytes there would be testdate available. I will merge the #195 since it fixes the error; and will add some of comments in on the ongoing discussions about aggregation for future improvements. Feel free to weigh in on that if you want. https://github.com/ncss-tech/soilDB/discussions/160