Closed florisvdh closed 4 years ago
A vignette has been added to demonstrate package & data setup (using download_zenodo()
), sf
object handling & plotting, with read_soilmap()
as a case!
Thanks to @devosbr, for his comments on the read_soilmap()
documentation and his suggestions for variable names.
Further, note that today the soilmap_simple
data source has been published at Zenodo! It is a recommended data source for a reproducible & tidy analytical workflow.
Guess this PR is ready - if someone likes to test, this is more than welcome. Start with this:
Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS = "true")
library(remotes)
install_github("inbo/n2khab@read_soilmap",
build_vignettes = TRUE,
upgrade = TRUE)
vignette("v022_example", package = "n2khab")
The environment variable R_REMOTES_NO_ERRORS_FROM_WARNINGS
is set here to solve potential errors from the remotes
package: it turns installation warnings into errors by default (as was seen for n2khab
here, and counterargued (but closed) in r-lib/remotes#403). Perhaps this should go into the README as well.
Let me know your findings!
Thanks @hansvancalster, I'll keep in mind your (fine) ideas to mention the use of knitr::purl()
and the factor level translation. The latter holds both for the simple and the raw data source. This would be an option in read_soilmap()
. I rather prefer to return extra columns (with the translated levels), rather than replacing factor levels, what do you think?
I'm curious whether other reviewers will also be as lucky with installation and with running the code :wink:
I rather prefer to return extra columns (with the translated levels), rather than replacing factor levels, what do you think?
Yes replacing the factor levels is not the best option. But would adding extra columns inflate the size of the objects in the R environment too much (don't remember but I think it took quite some memory already)? A translation table can always be joined after filtering the soil map to only what is needed for the analysis/visualisation.
Good to remind about object sizes @hansvancalster, I actually don't know how large would be the consequence of extra columns (in essence it is one integer column per factor). It's indeed a good alternative to return a list of sf
object + translation table. (That's how it goes for read_habitatstreams()
with the source_text = TRUE
argument.) I may just go for that.
Also the levels in the data source need to be checked against the list of @devosbr (here) in order to use the latter. I'm inclined to go for the Dutch labels only, for now, until English labels are made available as well.
Maybe just check if adding an extra column increases the size a lot. Might be not too bad: the geometry column will determine most of the object size.
the geometry column will determine most of the object size.
Sure, right!
It appears that we did the extra column approach in read_watersurfaces()
(argument extended = TRUE
), but that's a smaller dataset anyway.
I found that with filter(bsm_ge_coastalplain == TRUE)
also polygons outside the "coastal plain" are selected. It seems like all polygons where bsm_region
equals "Kunstmatige gronden" are selected as well, spread across Flanders. This is the case for both soilmap
and soilmap_simple
.
sm_simple_co <- sm_simple %>%
filter(bsm_ge_coastalplain == TRUE)
sm_simple_co %>%
mapview(map.types = "OpenStreetMap")
Is that correct?
I found that with filter(bsm_ge_coastalplain == TRUE) also polygons outside the "coastal plain" are selected
Well spotted @DriesAdriaens :+1: , this clearly isn't correct and we'll have to solve this. Will look into it.
Thanks @DriesAdriaens for your review. I'll look into the specific comments.
Just wondered if "coastal plain" is the correct translation of "polders"
Well, it originally was "polders" until I changed it to "coastal plain" :wink: in 8e08faa. It think this was based on terminology of @devosbr in his document in January, maybe he has also told me (don't remember); anyhow he recently reviewed and updated the documentation of read_soilmap()
here so I expect it to be OK.
I wasn't aware that the dunes (bsm_region == "Duinen"
) were included in bms_ge_coastalplain
too. In that sense, "Polders" would indeed be a name that is too narrow to use for the area where the standardize_coastalplain
procedure is applied. Maybe "coastal area" or "coastal region" could be a more appropriate -and less misleading- than "coastal plain". Just my humble opinion.
Type_class == "Zeepolders"
from the raw data source has been used as condition to set bsm_ge_coastalplain
as TRUE
.
Probably I did a misinterpretation here; the explanation of this field is: Type van de bodemclassificatie (‘zeepolders’ of ‘rest van Vlaanderen’)
, which may refer to type categories rather than areas.
Did this exploration (expand by clicking the arrow): (source code: soilmap_zeepolders1.R.zip)
I'll discuss this, and the 'coastal plain' term with @devosbr and come back.
Anyhow, I find the value 'Zeepolders' for soiltypes occurring everywhere in Flanders quite confusing. I prefer not to keep this Type_class
information from the raw datasource in the result in R but rather choose a proper approach for setting bsm_ge_coastalplain
- which may be the last one, using bsm_ge_region
.
Whether the spatial occurrence of Type_class == "Zeepolders"
should be considered a problem in the raw data source at DOV or not, is another matter.
A nice addition would be to have the option to add
_explan
versions of the selection of variables insoilmap_simple
, similar to those of the rawsoilmap
data source.
Using _explan
variables to use as labels is the way to go indeed, it was all there already :satisfied:! No need for matching with separate explanatory lists as I was thinking yesterday. The number of unique values match in the raw data source, between c("bsm_mo_substr", bsm_mo_tex", bsm_mo_drain", bsm_mo_prof", bsm_mo_parentmat", bsm_mo_profvar")
and their _explan
counterparts (by all that I refer to the original variables behind it). That is why the factor levels were aligned when coding read_soilmap(use_processed = FALSE)
.
I'll check what's the effect on object size as discussed with @hansvancalster , to conclude how to best (optionally) return the _explan
info in the result of read_soilmap(use_processed = TRUE)
.
As soilmap_simple
will probably need an update (because of bsm_ge_coastalplain
), the best way to go will be to store the _explan
levels in an extra table within the GeoPackage, in order to easily extract it with the function.
@DriesAdriaens thanks for your most valuable review comments! :pray:
@DriesAdriaens follow-up of coastalplain affairs, after fruitful discussions today with @devosbr and further analysis:
terminology 'coastalplain': will be better defined in the function's documentation, but name will be kept. 'coastal region' would be interpreted more narrowly, 'polders' is too narrow as well; it is coastal estuaries + dunes + polders but that cannot be put short in variable's names. 'Coastal plain', just like 'Zeepolders', should just remind the soil expert of what is actually meant, which we can describe.
correcting the bsm_ge_coastalplain
(logical) variable:
bsm_ge_region
s can be counted in (TRUE
)bsm_region == "Kunstmatige gronden" & type_class == "Zeepolders"
(same as is.na(bsm_ge_region) & type_class == "Zeepolders"
) has several subcases:
soil_translation_coastalplain.tsv
(used by the standardize_coastalplain
argument of read_soilmap()
) can be counted in (TRUE
).c("OG2", "OO3", "OE1", "OO1", "OU2", "OZ", "OO", "OU1", "OG1", "OO2", "OU3", "OO4")
OZ
polygons), they are confined to the coastalplain area (the few aberrations are to be considered mistakes in the raw data source which we won't solve here).OL
, without such translation but with all its (4) polygons confined to the coastalplain area, can be counted in as well (TRUE
).OW
and OH
- fall completely outside the coastalplain area: they're not counted in (FALSE
)c("OB", "OE", "OT", "ON", "OC", "OA")
- occur both within and outside the coastalplain area, hence they will receive bsm_ge_coastalplain <- NA
That should result in a more acceptable labelling of polygons that originally had type_class == "Zeepolders"
in the raw data source.
We could go further by using a spatial definition - but that does not seem worth the trouble (it would not differ much from the above). Spatial definition would largely boil down to ecodistricts 'Kustduinendistrict' plus 'Kustpoldersdistrict', plus an upstream part of the Yser river valley... (see illustrations)
Rather, we wanted to return sensible information that is available in the soilmap
, so an update for 'Zeepolders' was needed. What cannot be correctly derived from the soilmap
, should be corrected in the raw data source in the first place.
Will generate an updated read_soilmap()
and soilmap_simple
, beginning of next week.
Further note concerning above 'coastal plain' terminology:
Summarizing reflections about the coastalplain soiltypes:
note about overlap: while the raw data source links all O*
soiltypes to 'Zeepolders' - which may be the historical origin of these classes indeed - Van Ranst & Sys (2000) separately list the occurring O*
soiltypes within and outside the coastalplain area, which clearly shows the overlap between the used soilltypes.
does it make sense to have a type_class
(type classification) variable in the raw data source 'soilmap'? Apart from the historical reasons, probably not. If one's aim is to see which codes are used where (cf. Van Ranst & Sys 2000), (s)he should do so by joining the spatial contour of the coastalplain area. Apart from the O*
soiltypes, it is always possible to find back the geomorphological regions, and codes used therein (other than O*
), by filtering for non-missing values of bsm_ge_region
.
does it make sense to return this type_class
(type classification) variable in R, when reading the raw data source 'soilmap'? Here I take a conservative view: the purpose of read_soilmap(use_processed = FALSE)
is to return all potentially useful information from the raw data source in a streamlined way. Hence I prefer to keep the information as previously returned for as long as the type_class
variable occurs in the raw soilmap
data source. However, the variable name bsm_ge_coastalplain
is better replaced by bsm_ge_typology
, with definition:
bsm_ge_typology
: Logical. Does the soiltype code follow the geomorphological typology?
does it make sense to return a variable bsm_ge_typology
when reading soilmap_simple
into R? Clearly not. Rather than trying a best possible approach to predict whether a polygon is inside or outside the coastalplain area - like in previous comment in an attempt to approach the original intent of bsm_ge_coastalplain
- it seems more useful to add a variable bsm_converted
, which is TRUE
for all soiltypes with non-missing bsm_ge_region
plus c("OG2", "OO3", "OE1", "OO1", "OU2", "OZ", "OO", "OU1", "OG1", "OO2", "OU3", "OO4")
. I.e. a variable that says whether morphogenetic texture and drainage category are the result of a conversion from these soiltypes. This conversion to non-missing texture and drainage categories is largely confined to the 'coastal plain' area, while whithin the 'coastal plain' area still other O*
codes occur. Without this information, one cannot easily trace back which soiltype records are original and which were converted post-hoc (conversions are an estimate, not observations). Hence it makes sense to make this distinction in soilmap_simple
for analytical purposes. This variable will be defined as:
bsm_converted
: Logical. Were morphogenetic texture and drainage variables (bsm_mo_tex
andbsm_mo_drain
) derived from a conversion table? ValueTRUE
is largely confined to the 'coastal plain' area. Only returned ifstandardize_coastalplain = TRUE
.
Action taken: the metadata of current version soilmap_simple_v1
at Zenodo (https://doi.org/10.5281/zenodo.3732904) has been corrected for the bsm_ge_coastalplain
variable:
bsm_ge_coastalplain
: boolean. Did the original soil type code follow the geomorphological typology? It isTRUE
for all polygons inside the coastal plain area, and also for a few soil type codes (starting with letterO
) with a wider distribution across Flanders (the latter belong to the soil types for which the typology could not be converted into a morphogenetic one).
Note to self: a bit code & output
Observations on object size:
library(n2khab)
library(tidyverse)
sm <-
read_soilmap(use_processed = FALSE, standardize_coastalplain = TRUE)
> object.size(sm) %>% print(units = "MiB")
394.8 MiB
> sm %>% select(-matches(".+_mo_.+_explan")) %>% object.size() %>% print(units = "MiB")
388.6 MiB
> sms <- read_soilmap()
> object.size(sms) %>% print(units = "MiB")
363.7 MiB
So just staying with columns for _explan
variables is best.
@hansvancalster considering some remaining points:
Add code in the vignette to show how to extract the R-code in case someone wants to try it without having to copy-paste
A slight alternative has been made for archival purposes (a04cb8c), but reverted immediately (0c242c9). Given that most chunks have eval=FALSE
, this resulted in consistent outcommenting of most R code with #>
, which is not convenient to run. Therefore, considered this addition as unpractical to the user. Eventually I added some code to copy the Rmd file, in e5bb79c.
when downloading the simple soil map the progress % and bytes were not showing correctly, but the download was done without problems
Was an upstream bug in the curl
package (see inbo/inborutils#79), should be fixed in its master (you may want to test! - see jeroen/curl#219) but it is not yet released.
the code with ggplotly did not render in the preview of my RStudio (1.1.463)
Dropped ggplotly in 99b5592.
@DriesAdriaens @hansvancalster Your suggestion regarding:
A nice addition would be to have the option to add _explan versions of the selection of variables in soilmap_simple, similar to those of the raw soilmap data source.
... has been implemented, both in the new version of the soilmap_simple
data source (https://doi.org/10.5281/zenodo.3747496) and by the explan
argument of the function (see documentation).
read_soilmap(explan = TRUE)
adds these as extra _explan
variables (resulting in variables effectively identical to those obtained when reading the raw data source). Adding as columns hardly inflates object size in R because they are factors.read_soilmap(explan = TRUE)
has been included in the vignette.explan = FALSE
. Consequently, the soilmap_simple
object in R remains most 'simple' by default.Adding read_soilmap()
and vignette to release candidate 0.2.0
now.
Note that more commits have to be added before merging; this PR is to be able to have discussion.