ncss-tech / soilDB

soilDB: Simplified Access to National Cooperative Soil Survey Databases
http://ncss-tech.github.io/soilDB/
GNU General Public License v3.0
81 stars 20 forks source link

get_chorizon_from_SDA() returning duplicate horizons #348

Closed dylanbeaudette closed 3 months ago

dylanbeaudette commented 4 months ago

I just noticed that fetchSDA() will sometimes return duplicate horizons, probably due to a join with a child table having >1 record per genetic horizon. The following warning supports that:

2: In `hzidname<-`(`*tmp*`, value = "chkey") :
  horizon ID name (chkey) not unique. unique ID not changed.

I would guess that it is the join to chtexturegrp and/or chtexture tables. I'd suggest converting rows → single delimited list to account for multiple textures. It is wild to think that the affected component has both cl and sl textures in the same horizon.

Previous, probably unrelated issues:

Example.

library(aqp)
library(soilDB)

sql <- "
SELECT 
mukey, co.cokey, compname, comppct_r, majcompflag,
hzdept_r, hzdepb_r, hzname
FROM component AS co
JOIN chorizon AS ch ON co.cokey = ch.cokey
WHERE mukey = '467060' 
ORDER BY ch.cokey, comppct_r DESC, hzdept_r ASC;
"

# demonstrate that the source data are fine
SDA_query(sql)

# problem seems to occur no matter what combination of arguments is used
x <- fetchSDA(WHERE = "mukey = '467060'", duplicates = FALSE, childs = FALSE)

# see duplicates:
horizons(x)[, 1:10]

# note duplicate horizon boundaries
plotSPC(x, hz.depths = TRUE)

# digging deeper, it seems like the following function is making duplicate horizons:
get_chorizon_from_SDA(WHERE = "mu.mukey = '467060'", duplicates = FALSE, childs = FALSE)
brownag commented 4 months ago

It appears that the issue here is that after filtering chtexturegrp to just representative values, chtexture table gets right joined. In the case of these "duplicate" horizons they are stratified textures (i.e. SR-SL CL) that are representative so it is technically valid to have multiple entries, including SL and CL. There are certainly more contrasting stratified textures out there.

I think the intent is to return texcl (from chtexture) as well as texture (from chtexturegrp). Stratified textures are a case where it is expected to have a many:1 relationship between texcl and texture, even when only the RV texture group.

I suppose we could concatenate texcl, and it would have no issues at least for most horizons. This would run into issues if folks are expecting all texcl column output to have one of the standard choice list items for soil texture levels (aqp::SoilTextureLevels())... though I would argue the user should have to manually decide how they want to handle stratified textures in terms of deriving additional values.

As far as combination of arguments having no effect despite the name duplicates and childs are not expected to have any effect on horizon level results, or whether child table texture information is aggregated.

dylanbeaudette commented 4 months ago

OK, thanks for checking. I agree that the data in question and 1:many relationships are reasonable.

This may be more of an aqp issue, or a reporting issue for the parent function fetchSDA() which returns a SoilProfileCollection. We haven't fully resolved the overlapping horizon issue there and it might be a good idea to warn users when this happens. In this case, issuing a warning or message from fetchSDA() seems like the best place to put it.

Adding a new argument to flatten multiple textures could lead to even more complexity in an already complex function.

Open to ideas on how to avoid unexpected outcomes...

brownag commented 4 months ago

In general the reporting that the chkey is not unique is intended to highlight this for the user on aqp side--so we don't suppress those warnings on soilDB side even though they happen fairly often with various database sources and SPC returning methods. It is perhaps not as clear as it could be, though; it would be easy in fetchSDA() to check if the hzidname() of result SPC is not "chkey" and give a more informative message.

The aqp PR for collapseHz() would be a way for folks to address this type of overlap due to join issues in post-processing.

I think it would not be that difficult to concatenate texcl and lieutex from chtexture table for final chorizon result, and I think this should be the default behavior. It wouldn't be that big of a deal to add an argument to get the old behavior--if anyone thinks that should be retained.

dylanbeaudette commented 4 months ago

Thanks. I'll put some time in later today and tomorrow on a more informative message to the user. I'd also like to try out the texture-concatonating solution you've suggested. I suspect that most of the time that kind of "flattening" would be helpful.

Still--walking the line of preserving the original intent of the data AND supporting robust analysis / visualization is a challenge.

I do like the idea of having a general purpose "horizon flattening" method like collapseHz() in aqp.