lcgodoy / smile

An R package to analyze spatially misaligned data.
https://lcgodoy.me/smile
GNU General Public License v3.0
8 stars 1 forks source link

Union polygons before creating point grid #10

Closed JsLth closed 9 months ago

JsLth commented 9 months ago

Hi @lcgodoy and thanks again for developing this package!

I recently got to try out the package to make point estimates from counties in Germany. However, I noticed that grid sampling is kind of weird when dealing with multipolygons. Particularly, my problem is with the following lines of sf_to_spm:

https://github.com/lcgodoy/smile/blob/da1194b137fc6a41e44f9e0aad6080796581393c/R/spm.R#L83-L84

The default behavior (in case of by_polygon = FALSE) is to extract the geometries of multipolygons and then cast them to "POLYGON". This creates some undesirable holes in the grid (see below). My first attempt was to re-sample without casting. This improved the regular point distribution quite a lot, but still left some irregularities ("No cast" in the plot below). Finally, I tried to union all polygons (ultimately, only the outer boundaries really matter, right?) and this left me with a perfectly regular grid over the study area.

This leads me to the questions, whether it would be reasonable to union all polygon geometries before sampling if by_polygon = FALSE. I think this is a common problem for administrative units (which tend to be fragmented to multipolygons sometimes) and might lead to some weird estimates around the problematic areas.

Here is a reprex:

library(sf)
library(smile)
library(tmap)

counties <- read_sf("https://sgx.geodatenzentrum.de/wfs_vg250?service=wfs&version=2.0.0&request=GetFeature&TYPENAMES=vg250_krs&outputFormat=application/json")
counties <- counties %>%
  group_by(ags) %>%
  summarise(gen = unique(gen)) %>%
  select(ags)
counties$var <- sample(seq(0, 1, 0.001), nrow(counties))

# Default approach: Cast multipolygons to polygons, then sample
spm <- smile::sf_to_spm(
  counties,
  n_pts = 2000,
  poly_ids = "ags",
  var_ids = "var"
)

# No cast approach: Cast all geometries to multipolygons, then cast back
# Note: The same results can be achieved by not casting at all, i.e.
# `st_geometry(sf_obj)` instead of `st_cast(st_geometry(sj_obj), "POLYGON")`
counties_poly <- counties %>%
  st_cast("MULTIPOLYGON") %>%
  st_cast("POLYGON")

spm_poly <- smile::sf_to_spm(
  counties_poly,
  n_pts = 2000,
  poly_ids = "ags",
  var_ids = "unempl"
)

# Union approach: Union all geometries before sampling because if
# by_polygon = FALSE, then everything inside the outer boundaries does not
# really seem to matter?
counties_union <- st_sf(st_union(counties))

spm_union <- smile::sf_to_spm(
  counties_union,
  n_pts = 2000,
  poly_ids = "ags",
  var_ids = "unempl"
)

grid_data <- bind_rows(
  "Default" = spm$grid,
  "No cast" = spm_poly$grid,
  "Union" = spm_union$grid, .id = "type"
)

tm_shape(grid_data) +
  tm_dots() +
  tm_facets("type")

d07e86aa-1f32-47c5-8559-64e843fdc079

lcgodoy commented 9 months ago

Hi @JsLth ,

Thank you for reporting this bug.

As a short-term solution, I fixed the problem with the casting you pointed out. Initially, the union case you pointed out was the default in previous package versions and, in my opinion, is more intuitive when setting by_polygon = FALSE.

About this

"This improved the regular point distribution quite a lot, but still left some irregularities”, I would not be bothered by what seems to be an irregularity. Those irregularities are regular grids within polygons with “weird” shapes. Moreover, I would rather use the No cast than the union grid because the former will estimate the numerical intervals in all the polygons with similar precision.

Below is a reprex of re-running your code with the updated version of the package (I just pushed to GitHub, so you will have to reinstall it from here. remotes::install_github(‘lcgodoy/smile’) will do it for you).

Meanwhile, I will work on making the union grid the default when setting by_polygon = FALSE.

Thanks again for reporting this bug.

library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(smile)
library(tmap)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

counties <- read_sf("https://sgx.geodatenzentrum.de/wfs_vg250?service=wfs&version=2.0.0&request=GetFeature&TYPENAMES=vg250_krs&outputFormat=application/json")

counties <- counties %>%
  group_by(ags) %>%
  summarise(gen = unique(gen)) %>%
  select(ags)

counties$var <- sample(seq(0, 1, 0.001), nrow(counties))

# Default approach: Cast multipolygons to polygons, then sample
spm <- smile::sf_to_spm(
  counties,
  n_pts = 2000,
  poly_ids = "ags",
  var_ids = "var"
)
#> Warning: st_centroid assumes attributes are constant over geometries

plot(spm$grid,
     pch = 15)


# No cast approach: Cast all geometries to multipolygons, then cast back
# Note: The same results can be achieved by not casting at all, i.e.
# `st_geometry(sf_obj)` instead of `st_cast(st_geometry(sj_obj), "POLYGON")`
counties_poly <- counties %>%
  st_cast("MULTIPOLYGON") %>%
  st_cast("POLYGON")
#> Warning in st_cast.sf(., "POLYGON"): repeating attributes for all
#> sub-geometries for which they may not be constant

spm_poly <- smile::sf_to_spm(
  counties_poly,
  n_pts = 2000,
  poly_ids = "ags",
  var_ids = "var"
)
#> Warning: st_centroid assumes attributes are constant over geometries

# Union approach: Union all geometries before sampling because if
# by_polygon = FALSE, then everything inside the outer boundaries does not
# really seem to matter?
counties_union <- st_sf(st_union(counties))

spm_union <- smile::sf_to_spm(
  counties_union,
  n_pts = 2000,
  poly_ids = "ags",
  var_ids = "var"
)

grid_data <- bind_rows(
  "Default" = spm$grid,
  "No cast" = spm_poly$grid,
  "Union" = spm_union$grid, .id = "type"
)

tm_shape(grid_data) +
  tm_dots() +
  tm_facets("type")


n_pts2 <- 2000 / nrow(counties)

spm_bp <- smile::sf_to_spm(
  counties,
  n_pts = n_pts2,
  by_polygon = TRUE,
  poly_ids = "ags",
  var_ids = "var"
)
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only

grid_data <- bind_rows(
  "Default" = spm$grid,
  "spm_by_poly" = spm$grid,
  "No cast" = spm_poly$grid,
  "Union" = spm_union$grid, .id = "type"
)

tm_shape(grid_data) +
  tm_dots() +
  tm_facets("type")

Created on 2024-01-26 with reprex v2.1.0

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.1 (2023-06-16) #> os macOS Sonoma 14.2.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2024-01-26 #> pandoc 3.1.6.1 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> abind 1.4-5 2016-07-21 [1] CRAN (R 4.3.0) #> base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.3.0) #> class 7.3-22 2023-05-03 [1] CRAN (R 4.3.1) #> classInt 0.4-10 2023-09-05 [1] CRAN (R 4.3.0) #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1) #> codetools 0.2-19 2023-02-01 [1] CRAN (R 4.3.1) #> crosstalk 1.2.0 2021-11-04 [1] CRAN (R 4.3.0) #> curl 5.0.2 2023-08-14 [1] CRAN (R 4.3.0) #> DBI 1.2.0 2023-12-21 [1] CRAN (R 4.3.1) #> dichromat 2.0-0.1 2022-05-02 [1] CRAN (R 4.3.0) #> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.1) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1) #> e1071 1.7-14 2023-12-06 [1] CRAN (R 4.3.1) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1) #> highr 0.10 2022-12-22 [1] CRAN (R 4.3.0) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0) #> htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0) #> KernSmooth 2.23-21 2023-05-03 [1] CRAN (R 4.3.1) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0) #> lattice 0.21-8 2023-04-05 [1] CRAN (R 4.3.1) #> leafem 0.2.0 2022-04-16 [1] CRAN (R 4.3.0) #> leaflet 2.2.1 2023-11-13 [1] CRAN (R 4.3.1) #> leafsync 0.1.0 2019-03-05 [1] CRAN (R 4.3.0) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> lwgeom 0.2-13 2023-05-22 [1] CRAN (R 4.3.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> Matrix 1.6-1.1 2023-09-18 [1] CRAN (R 4.3.1) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> png 0.1-8 2022-11-29 [1] CRAN (R 4.3.0) #> proxy 0.4-27 2022-06-09 [1] CRAN (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> raster 3.6-23 2023-07-04 [1] CRAN (R 4.3.0) #> RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.3.0) #> Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.3.1) #> reprex 2.1.0 2024-01-11 [1] CRAN (R 4.3.1) #> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.1) #> rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> sf * 1.0-15 2023-12-18 [1] CRAN (R 4.3.1) #> smile * 1.0.4.2 2024-01-26 [1] local (/Users/lcgodoy/git-projects/smile) #> sp 2.1-1 2023-10-16 [1] CRAN (R 4.3.1) #> stars 0.6-4 2023-09-11 [1] CRAN (R 4.3.0) #> terra 1.7-46 2023-09-06 [1] CRAN (R 4.3.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0) #> tmap * 3.3-3 2022-03-02 [1] CRAN (R 4.3.0) #> tmaptools 3.1-1 2021-01-19 [1] CRAN (R 4.3.0) #> units 0.8-5 2023-11-28 [1] CRAN (R 4.3.1) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.3.0) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.1) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0) #> XML 3.99-0.14 2023-03-19 [1] CRAN (R 4.3.0) #> xml2 1.3.5 2023-07-06 [1] CRAN (R 4.3.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```