isciences / exactextractr

R package for fast and accurate raster zonal statistics
https://isciences.gitlab.io/exactextractr/
272 stars 26 forks source link

add previous CPP_exact_extract2 #57

Closed kongdd closed 3 years ago

kongdd commented 3 years ago

This CPP_exact_extract2 is previous script of CPP_exact_extract in v0.5.0, which is valuable for multiple files extraction. Hope this function can be kept in the master branch. Thank you!

Those script has been tested in win10 and ubuntu 20.04. exactextractr works well on both of them.

dbaston commented 3 years ago

Can you provide an example usage where this performs better?

kongdd commented 3 years ago

In hydrological studies, huge shapefile with thousand features is quite often. The overlap operation is the most time-consuming procedure in extract. It is desired to separate the function of overlap and extract. I will give you an example few hours later.

kongdd commented 3 years ago

Sorry about the later response.

I have 21 tif files, corresponding to yearly ET estimation during 2000-2020. The shapefile is 10631 basins in the global scale.

(a) If separate the function of overlap and extract, the running time is 140 second; (b) If not, it is 190 second.

note: raster reading costs about 64 second second.

Because overlaping operation only performed once. This make (a) is 1.66 times faster than (b) (76 seconds vs. 126 seconds). Hence, I think it is necessary to keep previous version CPP_exact_extract function.

library(exactextractr)
library(extract2)
library(raster)
library(sf)

files <- dir("H:/global_WB/ET/diagnosed_PML/v017", "ERA5.*.tif", full.names = TRUE)
shp = read_sf("N:/Research/hydro/globalRunoff/INPUT/shp/globalRunoff_GSIM_10631basins.shp")

## 1. overlap and extract separately
blocks = overlap(files[1] %>% raster(), shp, return.id = FALSE) # overlap information for each feature
system.time(lst <- st_extract(files, blocks))
# 140.97s

## 2. exactextractr way
wkbs = sf::st_as_binary(sf::st_geometry(shp), EWKB=TRUE)
system.time({
    lst2 = llply(files, function(file) {
        r = brick(file) %>% readAll()
        exact_extract(r, shp, "mean")
    })
})
# 190.0 s

system.time({
    lst2 = llply(files, function(file) {
        r = brick(file) %>% readAll()
        # exact_extract(r, shp, "mean")
    }, .progress = "text")
})
# 64 s
dbaston commented 3 years ago

Why not make a RasterStack of all the input tif files?

kongdd commented 3 years ago

RasterStack is one solution. But not the best, because each tif file already have multiple bands. Besides, read all files into R will likely induce memory problem, especially for large files. If not reading into memory, so many features, the speed is very slow.

A small action will improve the performance substantially. It worth to keep previous version.

dbaston commented 3 years ago

How long does a RasterStack take without readAll ? You might also try a terra SpatRaster for better performance, at least with some file formats.

kongdd commented 3 years ago

Fast, 30s. But the result is incorrect. It looks only the result of the first year returned.

rs = files %>% map(rast) %>% do.call(c, .)
r = exact_extract(rs, shp, "mean")

image

dbaston commented 3 years ago

It looks like this happens because terra doesn't make the names unique, so the columns in the returned data frame overwrite each other. Try making the names unique yourself and give it a try.

kongdd commented 3 years ago

works. Using time reduced to 33s. But the result needs further tidy. But, I still think it worth to keep previous version, which is valuable for me. If no this function, st_extract and all of my self defined function zonal statistic function will unable to work.

Your small action, will give me or other users big convenience. Cheers!

dbaston commented 3 years ago

I'm happy to help troubleshoot performance problems and improve this package. But I am not seeing a performance problem that is addressed by this PR, and so I am hesitant to add code (especially with no documentation or tests) and accept its associated maintenance burden.

kongdd commented 3 years ago

Thank you. I will try hard to follow the step of exactextractr.

kongdd commented 3 years ago

Could you give some clue about how to use exact_extract correctly? The following script in a relative low computing efficiency, which used 97s.

rs <- files %>% map(rast) %>% do.call(c, .)
r <- exact_extract(rs, shp, "weighted_mean", weights = rs * 0 + 1) # only area weighted

In comparison, st_extract used 64s.

dbaston commented 3 years ago

weights = rs * 0 + 1 probably takes a while to execute, because it has to read in the entire input. If you don't want a weighted mean, why not exact_extract(rs, shp, 'mean') ?

kongdd commented 3 years ago

I need the area weighted mean. Is the following the correct?

rs <- files %>% map(rast) %>% do.call(c, .)
r <- exact_extract(rs, shp, "weighted_mean", weights = area(rs)) 
dbaston commented 3 years ago

terra::area doesn't do what I'd expect. Either way, exact_extract(rs, shp, 'weighted_mean', weights = 'area') should be faster, or at least less memory=intensive.

kongdd commented 3 years ago

Got it. Thank you!