This issue is to address poor extraction performance when extracting from rasters with large block sizes, as in #45.
To have the ability to process large data sets that cannot fit in memory, exactextractr uses the following approach for processing:
1) Divide each feature into chunks covering no more than max_cells_in_memory pixels (10 million by default)
2) Read raster values for those pixels only
3) Compute coverage fractions and calculate stats for all layers in raster
4) Repeat 1-3, if polygon covered more than max_cells_in_memory pixels
5) Begin processing the next feature
Overall, this performs quite well even though the same tiles/scanlines may be read multiple times for different features. However, this approach performs poorly for datasets that are written with especially large chunk sizes, such as NOAA NClimGrid.
Using the NClimGrid dataset, computing the mean temperature at 1385 time steps for Florida's 67 counties takes:
20 minutes for the file as distributed, with a chunk specification of (1 time / 596 lat / 1385 lon)
8 seconds for the same file rechunked using scanlines (1 time / 1 lat / 1385 lon) using nccopy nclimgrid_tavg.nc -d4 -s -c "time/1,lat/1,lon/1385" nclimgrid_tavg_rechunk.nc
5 seconds for the same file rechunked using 256x256 pixel tiles (1 time / 256 lat / 256 lon) using nccopy nclimgrid_tavg.nc -d4 -s -c "time/1,lat/256,lon/256" nclimgrid_tavg_tiled.nc
In this case, changing the chunk size improves performance by a factor of 240! But only the data author can control the chunk size.
Some ideas to improve this performance caveat:
Identify files that use a suboptimal chunking schema and notify the user (warning, message ?). This would require extending raster / terra with a function to report the block size, or obtaining this information independently through rgdal or some other means (example: https://gist.github.com/dbaston/357d5c30e5f4fe2ef32a1ca7653c29ea)
Check if the portion of the raster that covers the entire set of polygons can fit into memory, and read it in eagerly with terra::crop. This is simple to implement (and is done so by @mikejohnson51 in https://github.com/mikejohnson51/zonal). This vastly improves performance for poorly chunked files but may worsen it for optimally chunked files. It does not allow the processing of datasets that are too large to fit in memory.
Warn the user if data is too large to be eagerly fetched as in (2). This provides the user with an opportunity to divide the data into smaller units (states, countries, etc.) or increase max_cells_in_memory. This may mislead the user for the common case of optimally-chunked files, where there is no need for the data to fit into memory.
Automatically subdivide data via some clustering schema into units that fit in memory. This completely solves the problem for suboptimally chunked files but results in unnecessary work for for optimally-chunked files.
eagerly load the entire working area into memory, if possible (no. 2 above) and emitting a message if not (no. 3 above)
if eager loading is not possible, the file's chunk size will be inspected and compared to the size of the GDAL block cache, emitting a message if the block cache cannot hold at least one chunk from each layer (no. 1 above)
This issue is to address poor extraction performance when extracting from rasters with large block sizes, as in #45.
To have the ability to process large data sets that cannot fit in memory,
exactextractr
uses the following approach for processing:1) Divide each feature into chunks covering no more than
max_cells_in_memory
pixels (10 million by default) 2) Read raster values for those pixels only 3) Compute coverage fractions and calculate stats for all layers in raster 4) Repeat 1-3, if polygon covered more thanmax_cells_in_memory
pixels 5) Begin processing the next featureOverall, this performs quite well even though the same tiles/scanlines may be read multiple times for different features. However, this approach performs poorly for datasets that are written with especially large chunk sizes, such as NOAA NClimGrid.
Using the NClimGrid dataset, computing the mean temperature at 1385 time steps for Florida's 67 counties takes:
nccopy nclimgrid_tavg.nc -d4 -s -c "time/1,lat/1,lon/1385" nclimgrid_tavg_rechunk.nc
nccopy nclimgrid_tavg.nc -d4 -s -c "time/1,lat/256,lon/256" nclimgrid_tavg_tiled.nc
In this case, changing the chunk size improves performance by a factor of 240! But only the data author can control the chunk size.
Some ideas to improve this performance caveat:
warning
,message
?). This would require extending raster / terra with a function to report the block size, or obtaining this information independently through rgdal or some other means (example: https://gist.github.com/dbaston/357d5c30e5f4fe2ef32a1ca7653c29ea)terra::crop
. This is simple to implement (and is done so by @mikejohnson51 in https://github.com/mikejohnson51/zonal). This vastly improves performance for poorly chunked files but may worsen it for optimally chunked files. It does not allow the processing of datasets that are too large to fit in memory.max_cells_in_memory
. This may mislead the user for the common case of optimally-chunked files, where there is no need for the data to fit into memory.