r-spatial / stars

Spatiotemporal Arrays, Raster and Vector Data Cubes
https://r-spatial.github.io/stars/
Apache License 2.0
560 stars 94 forks source link

use pkg exactextractr for zonal statistics (aggregate.stars) #289

Closed edzer closed 4 years ago

edzer commented 4 years ago

See https://github.com/isciences/exactextract#background

@dbaston this looks really cool!

edzer commented 4 years ago

Better, and faster. Example:

library(stars)
# Loading required package: abind
# Loading required package: sf
# Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
tif = system.file("tif/L7_ETMs.tif", package = "stars")
x = read_stars(tif)
set.seed(1331)
p = st_sample(st_as_sfc(st_bbox(x)), 20) %>% st_combine()
v = st_voronoi(p) %>% st_collection_extract("POLYGON")
plot(x[,,,1], reset = FALSE)
plot(v, add = TRUE, border = 'red', lwd = 2)
a1 = aggregate(x, v, mean, exact = TRUE)
# Warning message:
# In .local(x, y, crop, ...) : Polygons transformed to raster CRS (EPSG:NA)
plot(a1[,,1], extent = x)
b1 = aggregate(x, v, sum, exact = TRUE)
# Warning message:
# In .local(x, y, crop, ...) : Polygons transformed to raster CRS (EPSG:NA)
plot(b1[,,1], extent = x)

a2 = aggregate(x, v, mean)
sum((a1-a2)[[1]])
# [1] 0.7340747
b2 = aggregate(x, v, sum)
sum((b1-b2)[[1]])
# [1] -0.0001235509
edzer commented 4 years ago

@appelmar

appelmar commented 4 years ago

exactextractr is awesome and really fast, I should check whether / how the C++ library could be interfaced with gdalcubes, too. Currently, I use a simple (not exact) rasterization approach with the advantage that it is easy to process large out-of-memory rasters chunk-by-chunk (I've tested this with Sentinel-2 images over North Rhine-Westphalia at 10m spatial resolution for a year and 55k polygons). By using GDAL only (gdal_rasterize in particular) directly from C++, this already gives a strong speedup over the raster::extract implementation but having an exact solution would be ideal.

dbaston commented 4 years ago

Cool! coverage_fraction is a big memory hog, since it generates one raster per feature. I wonder if I could generalize exact_extract a bit so that you could provide a callback to replace raster::getValuesBlock? That would be much, much faster and take advantage of exactextractr's ability to process arbitrarily large rasters chunk-by-chunk (max_cells_in_memory argument)

edzer commented 4 years ago

Ah, I hadn't given very large rasters much thought so far, but good point. As some stage, someone wants to do this exact for a polygon over Africa at a 10m resolution raster. For stars, I can see two approaches:

dbaston commented 4 years ago

What I'm suggesting is setting up a way for aggregate to invoke something like the following, under the hood:

a3 = exact_extract(as(x, 'Raster'), v, 'mean')

but without the conversion to a Raster. The only thing the raster package is providing here is a hook into GDALRasterIO, and I assume stars can provide an equivalent. This would let you process any number of polygons with a raster of any resolution without memory concerns, provided that FUN is something implemented by exactextractr.

edzer commented 4 years ago

Dan, could you point me to the code where exactextractr does that now with Raster objects? Note that with stars (and gdalcubes) we potentially have long time series of rasters ("a large stack"), and need to reuse the logic for computing weights.

dbaston commented 4 years ago

So, there are two paths through exactextractr:

a) pull out raster values for an entire polygon, return directly to the user or pass to an R function. Maximum flexibility for user but requires that all data fit in memory. b) pull out raster values for all (or a portion of) a polygon, process incrementally in C++. Have to use a predefined summary operation (min/max/mean/stdev etc.) but can process arbitrarily large data.

Raster values are pulled in (a) at https://github.com/isciences/exactextractr/blob/master/R/exact_extract.R#L259 and in (b) at https://github.com/isciences/exactextractr/blob/master/src/exact_extract.cpp#L103

LDalby commented 4 years ago

This is really cool!

I've been trying out some smaller examples before running it at 10 m resolution with several 100K polygons for the country (yes, I'm that guy :wink: )

I've been getting various errors when running on my own data and in the process of debugging that, I encountered what looks like a bug - when some values in the raster are NA all returned values become NA even if you set na.rm = TRUE:

library(stars)
#> Loading required package: abind
#> Loading required package: sf
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
tif = system.file("tif/L7_ETMs.tif", package = "stars")
x = raster::raster(tif)
#> Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
#>  but +towgs84= values preserved
x[10000:15000] <- 1  # Couldn't figure out how to do this as stars?

x <- st_as_stars(x)

set.seed(1331)
p = st_sample(st_as_sfc(st_bbox(x)), 5) %>% st_combine()
v = st_voronoi(p) %>% st_collection_extract("POLYGON")
plot(x, reset = FALSE)
plot(v, add = TRUE, border = 'red', lwd = 2)

a1 = aggregate(x, v, mean, na.rm = TRUE, exact = TRUE)
#> Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
#>  but +towgs84= values preserved

a1[[1]]
#> [1] 58.58705 66.07556 80.85028 80.76239 92.04400

x = raster::raster(tif)
#> Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
#>  but +towgs84= values preserved
x[10000:15000] <- NA

x <- st_as_stars(x)

set.seed(1331)
p = st_sample(st_as_sfc(st_bbox(x)), 5) %>% st_combine()
v = st_voronoi(p) %>% st_collection_extract("POLYGON")
plot(x, reset = FALSE)
plot(v, add = TRUE, border = 'red', lwd = 2)

a2 = aggregate(x, v, mean, na.rm = TRUE, exact = TRUE)
#> Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
#>  but +towgs84= values preserved

a2[[1]]
#> [1] NA NA NA NA NA

Created on 2020-08-28 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.2 (2020-06-22) #> os Ubuntu 20.04.1 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_US:en #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Copenhagen #> date 2020-08-28 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> abind * 1.4-5 2016-07-21 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [3] CRAN (R 4.0.0) #> backports 1.1.9 2020-08-24 [3] CRAN (R 4.0.2) #> callr 3.4.3 2020-03-28 [3] CRAN (R 4.0.0) #> class 7.3-17 2020-04-26 [4] CRAN (R 4.0.0) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2) #> cli 2.0.2 2020-02-28 [3] CRAN (R 4.0.0) #> codetools 0.2-16 2018-12-24 [4] CRAN (R 4.0.0) #> crayon 1.3.4 2017-09-16 [3] CRAN (R 4.0.0) #> curl 4.3 2019-12-02 [3] CRAN (R 4.0.0) #> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2) #> desc 1.2.0 2018-05-01 [3] CRAN (R 4.0.0) #> devtools 2.3.1 2020-07-21 [3] CRAN (R 4.0.2) #> digest 0.6.25 2020-02-23 [3] CRAN (R 4.0.0) #> dplyr 1.0.2 2020-08-18 [1] CRAN (R 4.0.2) #> e1071 1.7-3 2019-11-26 [1] CRAN (R 4.0.2) #> ellipsis 0.3.1 2020-05-15 [3] CRAN (R 4.0.0) #> evaluate 0.14 2019-05-28 [3] CRAN (R 4.0.0) #> exactextractr 0.4.0 2020-06-28 [1] CRAN (R 4.0.2) #> fansi 0.4.1 2020-01-08 [3] CRAN (R 4.0.0) #> fs 1.5.0 2020-07-31 [3] CRAN (R 4.0.2) #> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2) #> glue 1.4.1 2020-05-13 [3] CRAN (R 4.0.0) #> highr 0.8 2019-03-20 [3] CRAN (R 4.0.0) #> htmltools 0.5.0 2020-06-16 [3] CRAN (R 4.0.1) #> httr 1.4.2 2020-07-20 [3] CRAN (R 4.0.2) #> KernSmooth 2.23-17 2020-04-26 [4] CRAN (R 4.0.0) #> knitr 1.29 2020-06-23 [3] CRAN (R 4.0.2) #> lattice 0.20-41 2020-04-02 [4] CRAN (R 4.0.0) #> lifecycle 0.2.0 2020-03-06 [3] CRAN (R 4.0.0) #> lwgeom 0.2-5 2020-06-12 [1] CRAN (R 4.0.2) #> magrittr 1.5 2014-11-22 [3] CRAN (R 4.0.0) #> memoise 1.1.0 2017-04-21 [3] CRAN (R 4.0.0) #> mime 0.9 2020-02-04 [3] CRAN (R 4.0.0) #> pillar 1.4.6 2020-07-10 [3] CRAN (R 4.0.2) #> pkgbuild 1.1.0 2020-07-13 [3] CRAN (R 4.0.2) #> pkgconfig 2.0.3 2019-09-22 [3] CRAN (R 4.0.0) #> pkgload 1.1.0 2020-05-29 [3] CRAN (R 4.0.0) #> prettyunits 1.1.1 2020-01-24 [3] CRAN (R 4.0.0) #> processx 3.4.3 2020-07-05 [3] CRAN (R 4.0.2) #> ps 1.3.4 2020-08-11 [3] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [3] CRAN (R 4.0.0) #> R6 2.4.1 2019-11-12 [3] CRAN (R 4.0.0) #> raster 3.3-13 2020-07-17 [1] CRAN (R 4.0.2) #> Rcpp 1.0.5 2020-07-06 [3] CRAN (R 4.0.2) #> remotes 2.2.0 2020-07-21 [3] CRAN (R 4.0.2) #> rgdal 1.5-16 2020-08-07 [1] CRAN (R 4.0.2) #> rlang 0.4.7 2020-07-09 [3] CRAN (R 4.0.2) #> rmarkdown 2.3 2020-06-18 [3] CRAN (R 4.0.1) #> rprojroot 1.3-2 2018-01-03 [3] CRAN (R 4.0.0) #> sessioninfo 1.1.1 2018-11-05 [3] CRAN (R 4.0.0) #> sf * 0.9-5 2020-07-14 [1] CRAN (R 4.0.2) #> sp 1.4-2 2020-05-20 [1] CRAN (R 4.0.2) #> stars * 0.4-4 2020-08-27 [1] Github (r-spatial/stars@4bbaaa3) #> stringi 1.4.6 2020-02-17 [3] CRAN (R 4.0.0) #> stringr 1.4.0 2019-02-10 [3] CRAN (R 4.0.0) #> testthat 2.3.2 2020-03-02 [3] CRAN (R 4.0.0) #> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2) #> usethis 1.6.1 2020-04-29 [3] CRAN (R 4.0.0) #> vctrs 0.3.2 2020-07-15 [3] CRAN (R 4.0.2) #> withr 2.2.0 2020-04-20 [3] CRAN (R 4.0.0) #> xfun 0.16 2020-07-24 [3] CRAN (R 4.0.2) #> xml2 1.3.2 2020-04-23 [3] CRAN (R 4.0.0) #> yaml 2.2.1 2020-02-01 [3] CRAN (R 4.0.0) #> #> [1] /home/au206907/R/x86_64-pc-linux-gnu-library/4.0 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
edzer commented 4 years ago

Great example, thanks! This should fix it.

dbaston commented 3 years ago

What I'm suggesting is setting up a way for aggregate to invoke something like the following, under the hood:

a3 = exact_extract(as(x, 'Raster'), v, 'mean')

but without the conversion to a Raster. The only thing the raster package is providing here is a hook into GDALRasterIO, and I assume stars can provide an equivalent. This would let you process any number of polygons with a raster of any resolution without memory concerns, provided that FUN is something implemented by exactextractr.

@edzer , I am looking into supporting this on my end, but I cannot figure out how to get the values of a stars object as an array without using dplyr::pull (and I do not want to bring in the tidyverse as a dependency). In other worse, I'm looking for a non-tidyverse way of accomplishing this:

library(dplyr)
library(stars)

pop <- read_stars('gpw_v4_population_count_rev11_2020_30_sec.tif', proxy = TRUE)
aoi <- st_bbox(c(xmin=-74, xmax=-72, ymin=42, ymax=44), crs = st_crs(pop))

vals <- pull(st_as_stars(pop[aoi]), 1)

Is it possible?

edzer commented 3 years ago

Thanks for looking into this! Untried, but I think that should be:

vals <- st_as_stars(pop[aoi])[[1]]

after all a stars object is a list with arrays; [[1]] extracts the first list element.

przell commented 2 years ago

@edzer I was comparing the runtime of some aggregation/extraction methods for raster values into polygons. I discovered a strange behaviour when stars is loaded together with raster or terra when using aggregate(..., exact = TRUE)Some snippets according to the example above by @LDalby.

Only load stars - works

# libraries
library(stars)
#> Loading required package: abind
#> Loading required package: sf
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
#> Registered S3 methods overwritten by 'stars':
#>   method             from
#>   st_bbox.SpatRaster sf  
#>   st_crs.SpatRaster  sf
library(sf)
# library(raster)
# library(terra)

# stars object
tif = system.file("tif/L7_ETMs.tif", package = "stars")
x_stars = stars::read_stars(tif)
x_stars = x_stars %>% dplyr::slice(band, 1)

# create polygon
set.seed(1331)
p = st_sample(st_as_sfc(st_bbox(x_stars)), 5) %>% st_combine()
v_sfc = st_voronoi(p) %>% st_collection_extract("POLYGON")

# aggregate
ext_sf_exact = aggregate(x_stars, v_sfc, mean, exact = TRUE)
ext_sf_exact[[1]]
#> [1] 66.59332 71.27686 80.85028 84.54792 92.04400

Created on 2021-10-19 by the reprex package (v2.0.1)

Load stars and raster or (terra) - breaks

# libraries
library(stars)
#> Loading required package: abind
#> Loading required package: sf
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
#> Registered S3 methods overwritten by 'stars':
#>   method             from
#>   st_bbox.SpatRaster sf  
#>   st_crs.SpatRaster  sf
library(sf)
library(raster)
#> Loading required package: sp
# library(terra)

# stars object
tif = system.file("tif/L7_ETMs.tif", package = "stars")
x_stars = stars::read_stars(tif)
x_stars = x_stars %>% dplyr::slice(band, 1)

# create polygon
set.seed(1331)
p = st_sample(st_as_sfc(st_bbox(x_stars)), 5) %>% st_combine()
v_sfc = st_voronoi(p) %>% st_collection_extract("POLYGON")

# aggregate
ext_sf_exact = aggregate(x_stars, v_sfc, mean, exact = TRUE)
#> Error in aggregate.stars(x_stars, v_sfc, mean, exact = TRUE): for exact=TRUE, FUN should either be mean or sum
ext_sf_exact[[1]]
#> Error in eval(expr, envir, enclos): object 'ext_sf_exact' not found

Created on 2021-10-19 by the reprex package (v2.0.1)

Session Info

sessionInfo()
#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.28   withr_2.4.2     magrittr_2.0.1  reprex_2.0.1   
#>  [5] evaluate_0.14   highr_0.9       stringi_1.7.5   rlang_0.4.11   
#>  [9] cli_3.0.1       rstudioapi_0.13 fs_1.5.0        rmarkdown_2.11 
#> [13] tools_4.0.5     stringr_1.4.0   glue_1.4.2      xfun_0.26      
#> [17] yaml_2.2.1      fastmap_1.1.0   compiler_4.0.5  htmltools_0.5.2
#> [21] knitr_1.36

Created on 2021-10-19 by the reprex package (v2.0.1)

edzer commented 2 years ago

Ah, that's a nasty one - raster redefines mean as an S4 generic, just not inside stars' namespace:

> mean
function (x, ...) 
UseMethod("mean")
<bytecode: 0x562ad29829d0>
<environment: namespace:base>
> library(raster)
Loading required package: sp
> mean
standardGeneric for "mean" defined from package "base"

function (x, ...) 
standardGeneric("mean")
<environment: 0x562ad5aeefe8>
Methods may be defined for arguments: x
Use  showMethods(mean)  for currently available ones.
edzer commented 2 years ago

Should work now!

przell commented 2 years ago

Jup works! Thanks for the quick fix!