Closed tmieno2 closed 4 years ago
Thanks for putting together this benchmark. Here are some thoughts:
An important difference between the exactextractr
and velox
packages is the decision of exactextractr
to work with inputs from the raster
package. This has the benefit of avoiding the need to convert from what is arguably canonical representation of raster data in R into another data type to perform operations, and it allows the code to be written such that many operations can be performed on arbitrarily large data with limited memory. The velox raster type loads everything into memory, which often provides better performance but limits the scale of data that can be worked with.
Another difference is that exactextractr
is interested in computing the percentage of each pixel that is covered by a polygon, while velox
is approximating the result by testing the pixel centroid only.
Because of the above, it is likely that velox
will offer better performance in many cases if the user is willing to convert to a different data structure, accept the memory limitations, and the approximate result. I'd also add that if performance is a primary concern, it might be worth working outside R (the command-line version of this tool is substantially faster.)
Your benchmark is interesting because it uses RasterStack
inputs and many small simple (5-vertex!) polygons. Since I don't often work with either of these, the benchmark highlights some performance problems I hadn't encountered myself. I was able to make some substantial improvements, but I expect you'll continue to see better performance with velox
for these very simple polygons.
I'd be curious to see you repeat your testing with GitHub versions of both raster
and exactextractr
. The following changes have recently been made:
raster::extent
(f4718c92a6ca; benefits all types of inputs)RasterStack
-> RasterLayer
conversion (741a3e1dd3; benefits RasterStack/RasterBrick inputs)raster::getValuesBlock
(ec74efccd87fb67e94; benefits RasterStack/RasterBrick inputs))raster::getValuesBlock
itself (https://github.com/rspatial/raster/commit/58aea83cd78d0c6882 and https://github.com/rspatial/raster/commit/58aea83cd78d0c6; benefits all inputs)A benchmark I like to use is the README example:
# Extract mean precipitation for each month
# Takes 60s using CRAN exactextractr and raster; 16s using master
system.time(x <- exact_extract(prec, brazil, 'mean'))
# Extract mean precipitation for one month
# Takes 6s using CRAN exactextractr and raster; 3s using master
system.time(x12 <- exact_extract(prec[[12]], brazil, 'mean'))
First of all, thanks for sharing your thoughts.
And, oh my goodness. Here are the numbers:
1: only exactextractr updated to 0.4.0
polys_100: 0.749 seconds (1.12) polys_2500: 6.281 seconds (6.706) polys_10000: 21.763 seconds (23.655)
The numbers in parentheses are the ones I reported above (before updating to 0.4.0)
polys_100: 0.517 seconds (0.703) polys_2500: 1.51 seconds (0.446) polys_10000: 3.371 seconds (0.448)
The numbers in parentheses are by velox.
So, dramatic improvement indeed when both are updated to the latest version. This is really good to know. Thanks!
Haha, the examples are really toy examples. I was sort of taking the number of polygons to an extreme. But, 3000 polygons are definitely possible as we have more than 3000 US counties and we (economists) use counties as units of analysis often.
And, yes, I know exactextractr are doing more than velox and comparisons are not really fair. I like exactextractr better because of coverage-fraction and glad to see its performance getting much closer to velox.
Btw, even before I update to the latest version of raster and exactextractr, exact_extract()
was faster than velox in many circumstances. For example, I experimented with the performance of the two with a raster file with 200 million cells. exact_extract()
was considerably faster than velox when the number of polygons are small, and is still a little faster than velox even when the number of polygons is 8000. Which makes sense because of the way exactextractr and velox handle data as you described above. velox is really good to a certain point until the memory becomes the bottleneck.
Glad you're seeing a good improvement!
I don't think the number of polygons in your benchmark is extreme -- I'm often extracting with a set of 60,000 watershed boundaries -- but the low complexity of the polygons (5 vertices) is. That was valuable for highlighting some overhead in getValuesBlock
, but the conclusions might not carry over to polygons with more typical complexity, such as the county boundaries that you mentioned.
I don't think the number of polygons in your benchmark is extreme -- I'm often extracting with a set of 60,000 watershed boundaries -- but the low complexity of the polygons (5 vertices) is. That was valuable for highlighting some overhead in getValuesBlock, but the conclusions might not carry over to polygons with more typical complexity, such as the county boundaries that you mentioned.
Oh, I see. Got it.
60,000 polygons! No economists will ever need to do that. Haha. Even 8000 seemed extreme to me.
In any case, thanks a lot!
60,000 polygons! No economists will ever need to do that. Haha. Even 8000 seemed extreme to me.
@tmieno2 , hello from DC, now further away from you than UIUC. Unfortunately, I happened across this thread when benchmarking exact_extract
with velox
to recover values for 84,000 huc12 boundaries (CONUS). Your comment did not age well :D
Hey @bryanparthum. Yes, indeed. I said economists will never need to do that and it has been only two years since I made the comment....Haha.
Dear dbaston,
Thanks for this wonderful package. For my research, I have been exploring the fastest way to extract raster values for polygons. While I was experimenting with velox and exactextractr packages, I found that exactextractr's extraction speed suffers considerably as the number of polygons increases (with the spatial extent of the polygons and raster resolutions held fixed) while velox does not. On the other hand, velox suffers when the number of raster cells increase while exactextratr does not.
The code below demonstrates the first point. The raster data holds 1,630,950 cells. The number of polygons to which the raster values are extracted are varied: 1) 79, 2) 1622, and 3) 6187 polygons.
The amount of data to move around are very similar across different number of polygons as the resolution and spatial extent of the raster data stay the same across the experiments and because the polygons are not overlapping. I am guessing many many calls of
getValuesBlock()
(and other functions to process by chunk) is hurting the performance of exactextractr. If so, is there any way you can modify the way you process the data by chunk in a way that reduces the number of chunks to avoid this problem? I do not completley understand howexact_extract()
works, but just wanted to ask if it is possible because that would make the function even better (I think).