isciences / exactextractr

R package for fast and accurate raster zonal statistics
https://isciences.gitlab.io/exactextractr/
272 stars 26 forks source link

Evaluation error: std::bad_alloc #93

Closed p-schaefer closed 1 year ago

p-schaefer commented 1 year ago

When trying to extract raster values from a pretty large raster (260Gb uncompressed geoTiff) using polygons, However, I get the following error:

Error in CPP_exact_extract(x, x_ext, x_res, x_orig, weights, wkb, default_value,  : 
  Evaluation error: std::bad_alloc.

I assume this has something to do with memory limitations (though I am running this on a server with 256GB of RAM). It does run with terra::extract() but is slow. Is there anything that can be done to resolve this?

dbaston commented 1 year ago

The predefined summary operations should be able to work with arbitrarily large data in a small amount of memory, but if you are passing extracted data back to R before summarizing it you can run out of memory. What is the call to exact_extract?

p-schaefer commented 1 year ago

yeah, I am returning the raw data. And I was mistaken, terra::extract() actually crashes as well. The calculations I need to do on the rasters are quite complex, and I can't do it with the predefined summary operations. I'll have to do it in smaller chucks I suppose. Do you think there would be any way for exact_extract() to write the data straight to a csv or sqlite file or something like that when its too large to fit in memory? Thanks.

dbaston commented 1 year ago

If you can't use a predefined summary operation, the next best thing is to use a custom summary function (https://github.com/isciences/exactextractr#summary-functions). This should only require you to fit the pixels for a single polygon in memory.

p-schaefer commented 1 year ago

I am working with watersheds, and unfortunately there are some that are nearly the entire extent of the raster. Meaning I would need to hold nearly the entire dataset as a matrix in memory, and then perform the calculations which will likely require additional memory.

I've tried just the single largest polygon by itself and it still causes the above error, when returning either the raw matrix or trying to calculate the custom summary function. So unfortunately, there are some polygons where that just won't work that way.

dbaston commented 1 year ago

If there are polygons whose bounding box intersects ~4 billion pixels (if I'm doing the math right on 260gb), then the easiest solution might be to split the largest polygons (lwgeom::st_subdivide ?), perform the calculation on each sub-polygon, and then merge the results of those calculations.

p-schaefer commented 1 year ago

Unfortunately, I don't think all the calculations can be split up into chunks. I'm trying to calculate some hydrologically weighted landscape attributes similar to hydroweight. Specifically for the weighted standard deviation attribute, I would need to have the raw landscape pixel values, the weighting pixel values and several intermediate products all available simultaneously.

I recognize this is probably a niche case, and I can chunk up the polygons and write the data to a csv or sqlite file myself, but I assume it could probably be done faster directly through exact_extract().

dbaston commented 1 year ago

If it can't be processed in chunks, and the values for a single polygon do not fit in memory, what would you do with the values written to CSV? Wouldn't you have the same problem when you read the CSV into memory?

(I'm interested in supporting the niche cases, just trying to understand what can be done here.)

p-schaefer commented 1 year ago

Yes, you are correct, a CSV won't do it. It would have to be something like an sqlite file where the intermediate products can be calculated in chunks and stored on the disk.

dbaston commented 1 year ago

Hmm, I'm still not getting it. But to strictly answer your question, I don't think exact_extract is going to be able to split the polygon and write the values to a CSV/sqlite any faster than you splitting the polygon and calling exact_extract repeatedly. The majority of the runtime is usually raster I/O and R overhead. You can't do anything about the R overhead (unless you use the command-line version) but splitting/feeding the polygons in an order consistent with how your raster is written on disk should make a difference.

p-schaefer commented 1 year ago

Yes, I'm sure I'm not doing a great job explaining my situation that well. But if you imagine calculating a weighted standard deviation of a raster (lets say slope), weighted according to an inverse distance weighting from a point (lets call it IDW), we would need to first calculate the product of the slope and IDW for each pixel, and use that to calculate a weighted mean. Then subtract each original slope pixel from the weighted mean and multiply it again by that pixels IDW. The sum of those resulting pixels then gets divided by the sum of each IDW pixel multiplied by a term calculated from the number of non-zero weights.

Using the raw pixel values stored in a SQLite database, I have been able to perform these calculations using a SQL query in situations where the entire data doesn't fit into memory. I think it works because the product columns can be calculated in chunks and are stored on the disk, then summed, which can easily be done in memory.

Anyways, like I said, its probably a niche case which we can work around, I was just wondering if there was a way to more quickly write the exact_extract values straight to disk. Thanks again!

dbaston commented 1 year ago

exactextractr internally implements a weighted standard deviation using the West (1979) formula because the stdev operation considers cell coverage fractions as weights. If this were exposed to allow you to specify a weighting raster (as is the case with weighted_mean) could your calculation be expressed by exact_extract(slope, watersheds, 'weighted_stdev', weights=idw) ?

p-schaefer commented 1 year ago

Yes, that would be a preferable solution probably. Would this method work with arbitrarily large rasters?

Also, how would missing data be handled? I also have a set of rasters representing categorical landscape features (i.e., urban land cover) which are represented as a pixel value of 1 being present, and an NA for absent. There is no stdev calculation required for these features, but I do calculate a weighted mean. For these means, NA's should be considered 0's, but for numeric rasters (i.e., slope), NAs should be ignored. Is this something that can be implemented in exact_extract, or would it be better to recode the NAs to 0 in the categorical rasters for the purpose of weighted_mean calculation?

Thank you!

dbaston commented 1 year ago

Would this method work with arbitrarily large rasters?

Yes -- the standard deviation calculation only needs to keep track of various sums, so it doesn't need all the pixels in memory at once.

or would it be better to recode the NAs to 0 in the categorical rasters for the purpose of weighted_mean calculation?

You should be able to do this on the fly using the default_value / default_weight arguments.

p-schaefer commented 1 year ago

Thats great. Sorry, another question, Considering I have multiple weighting rasters (up to 4), and multiple input layers (right now around 100) and multiple summaries I'd like to calculate (both weighted and unweighted), what's the best approach for handling this? It would probably make sense to extract the data from each layer as few times as possible, but I don't know the best way to loop over the layers now. When I had access to the whole raw pixel dataset it made more sense, but I'm not sure anymore. Could exact_extract(slope, watersheds, 'weighted_stdev', weights=idw) handle an idw with multiple layers, applying each idw layer to each input layer? Can you supply multiple weighted and unweighted summary operations?

And a final question, assuming this can all be done in a single function call, would it be possible to run it in parallel by polygon. This is something that can be done in R, but I'm wondering about the memory usage. Would adjusting max_cells_in_memory be enough to guarantee I don't run out of memory? If so, do you have a rough rule of thumb you use to calculate the memory requirement for a single raster cell that could be used do calculate the amount of available memory for each process?

dbaston commented 1 year ago

You can give the weighted_stdevoperation a try using remotes::install_git('https://gitlab.com/isciences/exactextractr', ref='weighted-stdev').

Could exact_extract(slope, watersheds, 'weighted_stdev', weights=idw) handle an idw with multiple layers, applying each idw layer to each input layer?

You can call it with 1 input layer and N weight layers, or N input layers with 1 weight layer. For other combinations you'd need to write a loop.

And a final question, assuming this can all be done in a single function call, would it be possible to run it in parallel by polygon.

Last time I looked into this I think there were issues sharing the underlying raster objects, but that was a couple years ago.

do you have a rough rule of thumb you use to calculate the memory requirement for a single raster cell that could be used do calculate the amount of available memory for each process?

8 bytes per input raster pixel + 4 bytes per pixel for the coverage fraction.

p-schaefer commented 1 year ago

Thanks @dbaston for this update. Unfortunately when I try to run exact_extract(loi_rasts, watersheds, c("weighted_mean","weighted_stdev"), weights=iDW_rasts)

It ends up using 100% of my system memory and crashing RStudio without any error messages. Here iDW_rasts is a single layer dimensions : 27470, 26864, 1 (nrow, ncol, nlyr), and loi_rasts dimensions : 27470, 26864, 59 (nrow, ncol, nlyr).

Setting max_cells_in_memory=3e+05 did resolve this, but I wonder how max_cells_in_memory is implemented? Is it per layer? If so, the default value should not have run out of memory. The system I am testing on has 32GiB of memory, so using the default 3e+07 pixels and (8+4) bytes per pixel (times 2, one for X and one for weights) should only use up ~0.67 GiB. Even if 3e+07 pixels from each layer were read in at once (3e+07 (8+4) 56), it should still only be ~19GiB.

And also, its unfortunate that you can't use multilayer x and weight rasters:

You can call it with 1 input layer and N weight layers, or N input layers with 1 weight layer. For other combinations you'd need to write a loop.

Is there any chance this could be implemented? When there are situations like mine with multiple x and weight rasters is becomes very inefficient to reextract the raster values for each set of inputs multiple times.

Thanks again!

p-schaefer commented 1 year ago

Also, I think there might be an issue when weights has NA values, the entire weighted summary returns NA, which I don't think is the intended behavior. Setting default_weight to 0 returns a value, but I suspect treating NAs as 0s would have implications on the calculations.

dbaston commented 1 year ago

I wonder how max_cells_in_memory is implemented? Is it per layer?

Yes, so I would expect the default value to still work in your case. I would have to really trace through with actual data to see why it isn't working.

Is there any chance this could be implemented?

There's no technical reason it couldn't, I was just trying to mimic R conventions for recycling. Taking a cartesian product would conflict with this behavior in the case where we have N values and N weights, so it would have to be indicated with (yet another) argument to exact_extract.

Also, I think there might be an issue when weights has NA values, the entire weighted summary returns NA, which I don't think is the intended behavior.

It's not clear to me what the result of a weighted operation would be when some of the weights are undefined. Or are these cases where both value and weight are NA?

p-schaefer commented 1 year ago

Yes, so I would expect the default value to still work in your case. I would have to really trace through with actual data to see why it isn't working.

I'll see if I can get some of the data to you over the next few days. Might be a bit tough because its so big.

There's no technical reason it couldn't, I was just trying to mimic R conventions for recycling. Taking a cartesian product would conflict with this behavior in the case where we have N values and N weights, so it would have to be indicated with (yet another) argument to exact_extract.

Yes, this would stray from R's default behavior with matices. I wonder if a formula interface might make it more easily customizable?

It's not clear to me what the result of a weighted operation would be when some of the weights are undefined. Or are these cases where both value and weight are NA?

In my mind, I would think that if default_weight and default_value are both NA_real_, than an NA pixel in either case would be ignored in the associated calculations, which seems to be the case for missing x pixels , but not missing weight pixels?

dbaston commented 1 year ago

seems to be the case for missing x pixels , but not missing weight pixels?

Yes, it works mostly like weighted.mean with na.rm = TRUE. It's OK if you don't know a value -- just exclude it from the computation. But if a value is known and its associated weight is unknown, you can't do a weighted computation.

> weighted.mean(c(1, NA, 3), c(1, 2, 3), na.rm=TRUE)
[1] 2.5
> weighted.mean(c(1, 1, 3), c(1, NA, 3), na.rm=TRUE)
[1] NA
dbaston commented 1 year ago

I've merged in the weighted_stdev operation, which I think resolves the core of this issue (can now process all data in limited memory.) Feel free to open issues for the other topics (interpretation of max_cells_in_memory, calculating stats for all combinations of value/weight rasters).