hypertidy / vapour

GDAL API package for R
https://hypertidy.github.io/vapour/
82 stars 9 forks source link

new sf/s2 killing readme #103

Closed mdsumner closed 3 years ago

mdsumner commented 3 years ago

just a dump of example code from readme, I'm taking it out

Example removed from readme 2021-06-25

The following concrete example illustrates the motivation for vapour, by through a timing benchmark for one standard operation: extracting feature geometries from a data set within a user-defined bounding box. The data set is the one used throughout the book Geocomputation in R by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow, and can be obtained with the following code.

url <- file.path ("http://www.naturalearthdata.com/http//www.naturalearthdata.com",
                "download/10m/cultural/ne_10m_parks_and_protected_lands.zip")
download.file (url = url, destfile = "USA_parks.zip")
unzip (zipfile = "USA_parks.zip", exdir = "usa_parks")
fname <- "usa_parks/ne_10m_parks_and_protected_lands_area.shp"

That last fname is the file we're interested in, which contains polygons for all United States parks. We now construct a timing benchmark for three ways of extracting the data within a pre-defined bounding box of:

library (magrittr)
bb <- c (-120, 20, -100, 40)
names (bb) <- c ("xmin", "ymin", "xmax", "ymax")
bb_sf <- sf::st_bbox (bb, crs = sf::st_crs (4326)) %>%
    sf::st_as_sfc ()

First, we define a function to do the desired extraction using the sf package, comparing both st_crop and the sf::[ sub-selection operator:

f_sf1 <- function (fname)
{
    usa_parks <- sf::st_read (fname, quiet = TRUE)
    suppressMessages (suppressWarnings (
                    parks2 <- sf::st_crop (usa_parks, bb_sf)
                    ))
}
f_sf2 <- function (fname)
{
    usa_parks <- sf::st_read (fname, quiet = TRUE)
    suppressMessages (suppressWarnings (
                    parks2 <- usa_parks [bb_sf, ]
                    ))
}

Then three approaches using vapour, in each case extracting equivalent data to sf - that is, both geometries and attributes - yet simply leaving them separate here. The three approaches are:

  1. Read geometry as WKB, sub-select after reading, and convert to sfc lists;
  2. Read geometry as WKB pre-selected via an SQL statement, and convert to sfc lists; and
  3. Read geometry as json (text), pre-selected via SQL, and convert with the geojsonsf package
library (vapour)
f_va1 <- function (fname) # read all then sub-select
{
    ext <- do.call (rbind, vapour_read_extent (fname)) # bboxes of each feature
    indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
                   ext [, 3] > bb [2] & ext [, 4] < bb [4])
    g <- vapour_read_geometry (fname) [indx] %>%
        sf::st_as_sfc ()
    a <- lapply (vapour_read_attributes (fname), function (i) i [indx])
}
f_va2 <- function (fname) # read selection only via SQL
{
    ext <- do.call (rbind, vapour_read_extent (fname))
    indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
                   ext [, 3] > bb [2] & ext [, 4] < bb [4])
    n <- paste0 (vapour_read_names (fname) [indx], collapse = ",") # GDAL FIDs
    stmt <- paste0 ("SELECT FID FROM ", vapour_layer_names (fname),
                    " WHERE FID in (", n, ")")
    g <- vapour_read_geometry (fname, sql = stmt) %>%
        sf::st_as_sfc ()
    a <- vapour_read_attributes (fname, sql = stmt)
}
f_va3 <- function (fname) # convert json text via geojsonsf
{
    ext <- do.call (rbind, vapour_read_extent (fname)) # bboxes of each feature
    indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
                   ext [, 3] > bb [2] & ext [, 4] < bb [4])
    n <- paste0 (vapour_read_names (fname) [indx], collapse = ",") # GDAL FIDs
    stmt <- paste0 ("SELECT FID FROM ", vapour_layer_names (fname),
                    " WHERE FID in (", n, ")")
    g <- vapour_read_geometry_text (fname, textformat = "json", sql = stmt) 
    g <- lapply (g, function (i) geojsonsf::geojson_sfc (i) [[1]]) %>%
        sf::st_sfc ()
    a <- vapour_read_attributes (fname, sql = stmt)
}

The benchmark timings - in particular the "relative" values - then illustrate the advantages of vapour:

rbenchmark::benchmark (
                       f_sf1 (fname),
                       f_sf2 (fname),
                       f_va1 (fname),
                       f_va2 (fname),
                       f_va3 (fname),
                       replications = 10)
junk <- file.remove ("USA_parks.zip")
unlink ("usa_parks", recursive = TRUE)

Reading geometries only, as opposed to the sf reading of all geometries and attributes, affords a speed increase of about 25%, while utilizing the SQL capabilities of ogr_sql offers an increase of around 75%.

mdsumner commented 3 years ago

gives this error

https://twitter.com/mdsumner/status/1408324361400901635/photo/1

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] s2_1.0.6             magrittr_2.0.1       dplyr_1.0.7          vapour_0.6.5         lazyraster_0.5.0     stars_0.5-3         
 [7] sf_1.0-0             abind_1.4-5          raadtools_0.6.0.9020 raster_3.4-13        sp_1.4-5            

loaded via a namespace (and not attached):
 [1] magic_1.5-9          pkgload_1.2.1        bit64_4.0.5          vroom_1.5.1          assertthat_0.2.1     yaml_2.2.1          
 [7] remotes_2.4.0        progress_1.2.2       sessioninfo_1.1.1    pillar_1.6.1         lattice_0.20-44      glue_1.4.2          
[13] digest_0.6.27        colorspace_2.0-1     htmltools_0.5.1.1    pkgconfig_2.0.3      devtools_2.4.2       purrr_0.3.4         
[19] scales_1.1.1         processx_3.5.2       tzdb_0.1.1           tibble_3.1.2         proxy_0.4-26         generics_0.1.0      
[25] usethis_2.0.1        ellipsis_0.3.2       cachem_1.0.5         withr_2.4.2          reproj_0.4.2         cli_2.5.0.9000      
[31] crayon_1.4.1         ps_1.6.0             memoise_2.0.0        maptools_1.1-1       evaluate_0.14        fs_1.5.0            
[37] ncdf4_1.17           fansi_0.5.0          lwgeom_0.2-6         foreign_0.8-81       class_7.3-19         pkgbuild_1.2.0      
[43] rsconnect_0.8.18     tools_4.1.0          prettyunits_1.1.1    hms_1.1.0            lifecycle_1.0.0      raadfiles_0.1.3.9022
[49] stringr_1.4.0        munsell_0.5.0        callr_3.7.0          compiler_4.1.0       e1071_1.7-7          RNetCDF_2.4-2       
[55] quadmesh_0.5.0       rlang_0.4.11         classInt_0.4-3       units_0.7-2          grid_4.1.0           rstudioapi_0.13     
[61] rmarkdown_2.9        proj4_1.0-10.1       wk_0.4.1             testthat_3.0.3       geometry_0.4.5       codetools_0.2-18    
[67] curl_4.3.2           DBI_1.1.1            R6_2.5.0             knitr_1.33           fastmap_1.1.0        rgeos_0.5-5         
[73] bit_4.0.4            utf8_1.2.1           rprojroot_2.0.2      desc_1.3.0           KernSmooth_2.23-20   stringi_1.6.2       
[79] parallel_4.1.0       Rcpp_1.0.6           vctrs_0.3.8          png_0.1-7            tidyselect_1.1.1     xfun_0.24  
mdsumner commented 3 years ago

you can st_make_valid the junk