isciences / exactextractr

R package for fast and accurate raster zonal statistics
https://isciences.gitlab.io/exactextractr/
272 stars 26 forks source link

Manual replication of exact extract #106

Closed sciabolazza closed 3 months ago

sciabolazza commented 3 months ago

Consider the following toy example. I create the shapefile of US states, and generate a raster from this shapefile. To each pixel of the raster I associate a unique value. Then, I exact extract the sum and the average values of the pixels falling within each state:

# load libraries
library(sf)
library(maps)
library(dplyr)
library(terra)
library(stars)
library(tidyterra)
library(exactextractr)

# map of the us
us <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
us <- st_transform(us, 4326)
# transform the map into a grid
us_grid <- st_make_grid(us, n = 1000) %>% 
  st_as_sf(.) %>% 
  mutate(value = 1:n()) 
# rasterize the grid
us_star <- st_rasterize(us_grid)
# raster: from star to spatraster
r <- rast(us_star)
# exact extract
res <- exact_extract(r, us, c("mean", "sum"))
res <- bind_cols(st_drop_geometry(us), res) %>% as_tibble()
res # Result

# A tibble: 49 × 3
   ID                      mean        sum
   <chr>                  <dbl>      <dbl>
 1 alabama              316548. 187755824 
 2 arizona              377844. 504420768 
 3 arkansas             403013. 254679920 
 4 california           500085. 954256576 
 5 colorado             571636. 745238464 
 6 connecticut          680399.  44530280 
 7 delaware             572282.  14361164 
 8 district of columbia 567871.    468068.
 9 florida              145673.  90250880 
10 georgia              310336. 210169088 
# ℹ 39 more rows
# ℹ Use `print(n = ...)` to see more rows

I now try to replicate the results without asking exact extract to generate the average and total values associated to a state, but only to indicate which pixels falls within a State.

In order to make things easier, I only consider Florida.

If I compare the average and total values of pixels falling within Florida calculated by me and by exact extract in the previous exercise, things are pretty much the same:

# select Florida
fl_values <- exact_extract(r, us %>% filter(ID == "florida"))
# values of cells in Florida 
fl_values <- as_tibble(fl_values[[1]])

#Sum value of cells in florida in fl_values
sum(fl_values[, 1] * # value of the pixel
      fl_values[, 2], # portion of the pixel coverage
    na.rm = T)
[1] 90250878
#Sum value of cells in florida in res
res %>% filter(ID  == "florida") %>% pull(sum)
[1] 90250880
#Average value of cells in florida in fl_values
sum(fl_values[, 1] * # value of the pixel
      fl_values[, 2], # portion of the pixel coverage
    na.rm = T)/sum(fl_values[, 2])
[1] 145666.5
#Average value of cells in florida in res
res %>% filter(ID  == "florida") %>% pull(mean)
[1] 145673.3

I now try to replicate the same exercise, but this time I identify by my own which cells fall within Florida, and then I calculate their average and total value.

This time results are similar when calculating the average, but differ significantly when calculating the total value.

# Repeat the same exercise manually with us_grid
fl <- us %>% filter(ID == "florida")
fl_grid <- st_join(us_grid, fl) %>% 
  mutate(area = st_area(.) %>% as.numeric()) %>% 
  filter(!is.na(ID))
fl_grid_cut <- st_intersection(fl_grid, fl %>% select(-ID)) %>% 
  mutate(area_cov = st_area(.) %>% as.numeric(),
         area_cov = area_cov/area) %>% 
  st_drop_geometry()

sum(fl_grid_cut$value * fl_grid_cut$area_cov)/sum(fl_grid_cut$area_cov)
[1] 145679.3
#Average value of cells in florida in res
res %>% filter(ID  == "florida") %>% pull(mean)
[1] 145673.3

sum(fl_grid_cut$value * fl_grid_cut$area_cov )
[1] 1392929296
#Sum value of cells in florida in res
res %>% filter(ID  == "florida") %>% pull(sum)
[1] 90250880

It could be that these differences are explained in the documentation of the package, but I failed to find them. Is it possible to provide more details on how exact extract creates zonal statistics?

dbaston commented 3 months ago

This time results are similar when calculating the average, but differ significantly when calculating the total value

us_grid has dimensions 1000x1000 but us_star has dimensions 393x166. There are not as many values in us_star, so the sum is lower.

Is it possible to provide more details on how exact extract creates zonal statistics?

They are documented at https://github.com/isciences/exactextract?tab=readme-ov-file#supported-statistics

dbaston commented 3 months ago

Closing this for now. Please feel free to reopen if things still don't work out.