Closed kevinwolz closed 1 year ago
Hi @kevinwolz thanks for reporting this issue. Or, rather, I see two different issues.
Approach 1 (that fails) appears to be due to scientific notation formatting being applied to a small subset of the mukeys (only one that I can see, actually. Specifically the problem mukey
in this case is 600000
v.s. "6e+05"
. This approach works for some counties/survey areas because they lack mukeys that can be more simply represented in scientific notation.
To avoid this problem use explicit conversion to integer with as.integer()
:
where1 <- as.integer(mukeys1) %>%
paste0("'", ., "'") %>%
toString() %>%
paste0("mukey IN (", ., ")")
We have tried to handle this type of input sanitization in places that take mukey
as an argument directly, but it can't be fixed in general for custom WHERE clauses. Just be aware that some mukey will be abbreviated in scientific notation, and you can set e.g. options(scipen=99)
or something to disable scientific notation altogether.
As far as why Approach 2 only returns a fraction of the mukeys that Approach 1 does--this is because they are querying a different extent. In the case of the mukey.wcs()
call you are getting all mukeys in a rectangular bounding box around an AOI with no masking of the area of interest.
In contrast, the SDA_spatialQuery()
approach returns only the mukeys that occur within the bounds of the AOI. Visually you can compare by projecting aoi
to "EPSG:5070"
(CRS of the mukey.wcs()
result) and plotting them. You see many more mukeys in the whole raster than fall within the polygon.
aoi <- sf::st_read("/vsizip/Bannock_ID.kml.zip") # Bannock County, ID
x <- soilDB::mukey.wcs(aoi = aoi, db = "gssurgo", res = 30, quiet = TRUE)
plot(x)
plot(terra::project(terra::vect(aoi), x), add=TRUE)
@kevinwolz, if you are interested in spatial data for all of CONUS, I'd highly recommend using either / both of
The mukey grids are kind of like a gSSURGO/gNATSGO-lite, as most of the tabular data can be accessed via SDA. Note that raster soil survey (RSS) data are not yet in SDA.
I'm glad that you are interested in using the mukey WCS. More information on that here:
Thank you @brownag for explaining both cases! Very helpful. It would have taken me forever to figure out the scientific notation piece. I've added integer conversion to my script. And for the second part, I forgot that SDA_spatialQuery()
doesn't use the bounding box approach as well. Got it!
@dylanbeaudette thank you for your suggestions. I think I'm doing what you're suggesting, but maybe I'm missing something. Here's my general process:
tigris
soilDB::mukey.wcs()
soilDB::SDA_query()
Are you suggesting that I start with the full CONUS mukey grid that you linked to? I wish I could do that, but, at the 30m resolution that I need, I think it just takes too much memory to work with the full CONUS grid. My analysis step is complex, memory intensive, and requires ultimately saving data in EPSG:3857
. So far it seems easiest for me to do the analysis at the county level (which I can then also parallelize nicely), and then merge all the counties together at the end.
Does this seem logical? Am I missing something?
@dylanbeaudette one thing that I think is not good with my method is that there will be large gaps in the CONUS data where SSURGO data does not exist but where STATSGO is only available. The SoilWeb Soil Properties map does not have these gaps, but perhaps that is not using SSURGO data since the resolution there is much lower than 30m.
How would you recommend dealing in a way that balances the 30m requirement and processing intensity of the analysis with needing to get CONUS with no gaps? Perhaps I need to be checking to see if soilDB::mukey.wcs(db = "gssurgo")
has not returned data for the full AOI, and, if not, call soilDB::mukey.wcs()
again with a differnet db
? Then I would somehow merge these together using terra
, giving priority to the gSSURGO data?
@kevinwolz I'll reply in detail later today. Quick answer:
I have examples and further discussion to share later on today. Good luck!
Hi, @kevinwolz here are some follow-up questions and suggestions. First off, are you willing to use STATSGO data in place of missing SSURGO data? If so, then the gNATSGO mukey grid may solve some of your problems. The presence of RSS mukey values (and lack of tabular RSS data on SDA) creates a couple new problems. You could "roll your own" SSURGO/STATSGO hybrid using map unit key grids. The STATSGO2 mukey grid is available via mukey.wcs()
, but the source data aren't official in any way. You can easily create a STATSGO2 mukey grid on your own though. Thankfully, the RSS tabular data will be on SDA "soon".
The SoilWeb Soil Properties maps are gap-free because "holes" in SSURGO (coarsened to a 800m grid) are back-filled with STATSGO.
What is the final product that you have in mind? While this may not solve all of your problems, take a look at this experimental tiled approach to making CONUS maps. I still need to get in there and optimize a couple steps but it is pretty efficient.
I think the main questions in this issue (failure with abbreviated mukey, and the differences between number of unique mukey by different methods) have been answered.
We are delving into different, important topics. I see two main discussions which could continue here
1) how to efficiently process spatially extensive soil survey data (i.e. all of CONUS) 2) how to deal with "gaps" in the SSURGO dataset
For item 1, larger/local tabular data are reasonably well covered by soilDB methods, e.g. by specifing dsn
argument and pointing at "snapshot" databases stored locally rather than web services. Corresponding methods for wrangling the spatial data lack dedicated functions in the package, and functions like mukey.wcs()
are limited to relatively small grid sizes.
For item 2, there are official methods for how SSURGO is backfilled with STATSGO2 and/or Raster Soil Survey to create continuous coverage (gNATSGO), but these backfilled areas aren't necessarily suitable replacements for detailed SSURGO. Currently STATSGO data can be summarized like SSURGO, but often areless complete, lacks interpretations etc.
Raster Soil Survey mukeys are usually quite detailed, but not yet available through any web service. Stitching together a queryable SQLite database that works for RSS mukeys can be done, but currently createSSURGO()
(for example) operates only on the standard vector and tabular data associated with Web Soil Survey downloads.
In some cases there would be value added to having soilDB functions process local spatial data, but it may be that certain types of processes, especially for large spatial data workflows, will be implemented using machinery outside the package.
I have a script that pulls mapunit data for every county in CONUS. There are just 3 counties in all of CONUS that fail. Here is one of them: Bannock_ID.kml.zip
aoi <- sf::st_read("Bannock_ID.kml") # Bannock County, ID
APPROACH 1 (fails)
APPROACH 2 (works)
Why does Approach 1 fail, when Approach 2 works?
Approach 1 seems to add many extra mukeys. None are duplicates. Something about or in those extra mukeys must be causing the problem.
When I run the same above approaches for any other county (except the 2 others that act the same: Brown County, NE and Caribou County, ID),
length(mukeys1)
is always greater thanlength(mukeys2)
, BUT Approach 1 does not produce an error. For example, here is Champaign, IL: Champaign_IL.kml.zipWhy does Approach 1 produce so many extra mukeys compared to Approach 2 for all regions?