Potentially make exact_extract faster?

tmieno2 commented 4 years ago

Dear dbaston,

Thanks for this wonderful package. For my research, I have been exploring the fastest way to extract raster values for polygons. While I was experimenting with velox and exactextractr packages, I found that exactextractr's extraction speed suffers considerably as the number of polygons increases (with the spatial extent of the polygons and raster resolutions held fixed) while velox does not. On the other hand, velox suffers when the number of raster cells increase while exactextratr does not.

The code below demonstrates the first point. The raster data holds 1,630,950 cells. The number of polygons to which the raster values are extracted are varied: 1) 79, 2) 1622, and 3) 6187 polygons.

The velox approach finishes the extraction job in less than a second.
The exactextractr approach takes 1.12, 6.706, and 23.655 seconds, respectively.

The amount of data to move around are very similar across different number of polygons as the resolution and spatial extent of the raster data stay the same across the experiments and because the polygons are not overlapping. I am guessing many many calls of getValuesBlock() (and other functions to process by chunk) is hurting the performance of exactextractr. If so, is there any way you can modify the way you process the data by chunk in a way that reduces the number of chunks to avoid this problem? I do not completley understand how exact_extract() works, but just wanted to ask if it is possible because that would make the function even better (I think).

library(velox)
library(raster)
library(sf)
library(exactextractr)

#===================================
# Prepare datasets
#===================================
#--- US state boundaries ---#
US_state <- st_as_sf(map(database = "state", plot = FALSE, fill = TRUE)) 

#--- raster data ---#
raster_US <- st_rasterize(US_state) %>% 
  as("Raster") %>% 
  disaggregate(fact = 5)  

#--- create a brick ---#
raster_US_brick <- brick(raster_US, raster_US, raster_US, raster_US, raster_US)

#--- create polygons for which we extract data from the brick ---#
polys_100 <- st_make_grid(US_state, n = c(10, 10))
polys_2500 <- st_make_grid(US_state, n = c(50, 50))
polys_10000 <- st_make_grid(US_state, n = c(100, 100))

#===================================
# Comparison: changing the number of polygons 
#===================================
#--------------------------
# polys_100 (79 polygons)
#--------------------------
length(polys_100)

#--- exact_extract ---#
tic()
a <- exact_extract(raster_US_brick, polys_100)
toc()

object.size(a) %>% format(units = "Mb")

#--- velox ---#
tic()
a <- velox(raster_US_brick)$extract(polys_100)
toc()

#--------------------------
# polys_2500 (1622 polygons)
#--------------------------
length(polys_2500)

#--- exact_extract ---#
tic()
a <- exact_extract(raster_US_brick, polys_2500, FUN = mean)
toc()

object.size(a) %>% format(units = "Mb")

#--- velox ---#
tic()
a <- velox(raster_US_brick)$extract(polys_2500)
toc()

#--------------------------
# polys_10000 (6187 polygons)
#--------------------------
length(polys_10000)

#--- exact_extract ---#
tic()
a <- exact_extract(raster_US_brick, polys_10000)
toc()

object.size(a) %>% format(units = "Mb")

#--- velox ---#
tic()
a <- velox(raster_US_brick)$extract(polys_10000)
toc()

dbaston commented 4 years ago

Thanks for putting together this benchmark. Here are some thoughts:

An important difference between the exactextractr and velox packages is the decision of exactextractr to work with inputs from the raster package. This has the benefit of avoiding the need to convert from what is arguably canonical representation of raster data in R into another data type to perform operations, and it allows the code to be written such that many operations can be performed on arbitrarily large data with limited memory. The velox raster type loads everything into memory, which often provides better performance but limits the scale of data that can be worked with.

Another difference is that exactextractr is interested in computing the percentage of each pixel that is covered by a polygon, while velox is approximating the result by testing the pixel centroid only.

Because of the above, it is likely that velox will offer better performance in many cases if the user is willing to convert to a different data structure, accept the memory limitations, and the approximate result. I'd also add that if performance is a primary concern, it might be worth working outside R (the command-line version of this tool is substantially faster.)

Your benchmark is interesting because it uses RasterStack inputs and many small simple (5-vertex!) polygons. Since I don't often work with either of these, the benchmark highlights some performance problems I hadn't encountered myself. I was able to make some substantial improvements, but I expect you'll continue to see better performance with velox for these very simple polygons.

I'd be curious to see you repeat your testing with GitHub versions of both raster and exactextractr. The following changes have recently been made:

avoiding expensive calls to raster::extent (f4718c92a6ca; benefits all types of inputs)
avoiding RasterStack -> RasterLayer conversion (741a3e1dd3; benefits RasterStack/RasterBrick inputs)
reducing calls to raster::getValuesBlock (ec74efccd87fb67e94; benefits RasterStack/RasterBrick inputs))
optimizing raster::getValuesBlock itself (https://github.com/rspatial/raster/commit/58aea83cd78d0c6882 and https://github.com/rspatial/raster/commit/58aea83cd78d0c6; benefits all inputs)

A benchmark I like to use is the README example:

# Extract mean precipitation for each month
# Takes 60s using CRAN exactextractr and raster; 16s using master
system.time(x <- exact_extract(prec, brazil, 'mean'))

# Extract mean precipitation for one month
# Takes 6s using CRAN exactextractr and raster; 3s using master
system.time(x12 <- exact_extract(prec[[12]], brazil, 'mean'))

tmieno2 commented 4 years ago

First of all, thanks for sharing your thoughts.

And, oh my goodness. Here are the numbers:

1: only exactextractr updated to 0.4.0

polys_100: 0.749 seconds (1.12) polys_2500: 6.281 seconds (6.706) polys_10000: 21.763 seconds (23.655)

The numbers in parentheses are the ones I reported above (before updating to 0.4.0)

both exactextractr and raster updated to the latest version

polys_100: 0.517 seconds (0.703) polys_2500: 1.51 seconds (0.446) polys_10000: 3.371 seconds (0.448)

The numbers in parentheses are by velox.

So, dramatic improvement indeed when both are updated to the latest version. This is really good to know. Thanks!

Haha, the examples are really toy examples. I was sort of taking the number of polygons to an extreme. But, 3000 polygons are definitely possible as we have more than 3000 US counties and we (economists) use counties as units of analysis often.

And, yes, I know exactextractr are doing more than velox and comparisons are not really fair. I like exactextractr better because of coverage-fraction and glad to see its performance getting much closer to velox.

Btw, even before I update to the latest version of raster and exactextractr, exact_extract() was faster than velox in many circumstances. For example, I experimented with the performance of the two with a raster file with 200 million cells. exact_extract() was considerably faster than velox when the number of polygons are small, and is still a little faster than velox even when the number of polygons is 8000. Which makes sense because of the way exactextractr and velox handle data as you described above. velox is really good to a certain point until the memory becomes the bottleneck.

dbaston commented 4 years ago

Glad you're seeing a good improvement!

I don't think the number of polygons in your benchmark is extreme -- I'm often extracting with a set of 60,000 watershed boundaries -- but the low complexity of the polygons (5 vertices) is. That was valuable for highlighting some overhead in getValuesBlock, but the conclusions might not carry over to polygons with more typical complexity, such as the county boundaries that you mentioned.

tmieno2 commented 4 years ago

I don't think the number of polygons in your benchmark is extreme -- I'm often extracting with a set of 60,000 watershed boundaries -- but the low complexity of the polygons (5 vertices) is. That was valuable for highlighting some overhead in getValuesBlock, but the conclusions might not carry over to polygons with more typical complexity, such as the county boundaries that you mentioned.

Oh, I see. Got it.

60,000 polygons! No economists will ever need to do that. Haha. Even 8000 seemed extreme to me.

In any case, thanks a lot!

bryanparthum commented 2 years ago

60,000 polygons! No economists will ever need to do that. Haha. Even 8000 seemed extreme to me.

@tmieno2 , hello from DC, now further away from you than UIUC. Unfortunately, I happened across this thread when benchmarking exact_extract with velox to recover values for 84,000 huc12 boundaries (CONUS). Your comment did not age well :D

tmieno2 commented 2 years ago

Hey @bryanparthum. Yes, indeed. I said economists will never need to do that and it has been only two years since I made the comment....Haha.

isciences / exactextractr

Potentially make exact_extract faster? #34