ncss-tech / soilReports

An R package that assists with the setup and operation of a collection of soil data summary, comparison, and evaluation reports. These reports are primarily used by USDA-NRCS soil scientists in both initial and update mapping.
15 stars 5 forks source link

sampleRasterStackByMU - memory allocation issues on Windows 10? #95

Closed brownag closed 5 years ago

brownag commented 6 years ago

I have been attempting to perform constant-density sampling using a set of polygons spanning a fairly broad latitudinal range in MLRA 18 (all polygons where Flanly series is a major component)

The rasters are successfully loaded into memory (where possible) and the following error message occurs during the sampling/extraction process. For instance:

test <-sampleRasterStackByMU(mu, mu.set, mu.col, raster.list, pts.per.acre, estimateEffectiveSampleSize = correct.sample.size)
Loading raster data...
Checking raster/MU extents...
Sampling polygons, and extracting raster values...
  |                                                                                                                |   0%
Error: cannot allocate vector of size 5.7 Gb

image

Upon removing the 30m regional datasets (just using 800m PRISM data) the overflow behavior is not observed.

This does not appear to be an inherent problem with the code ... because it runs fine with a smaller input dataset. It may not scale well. However, this report was previously running fine (if a bit slow to sample) under Windows 7 with all of the larger 10m or 30m rasters included in the sampling stack.

I'll continue to try and trace this issue.

brownag commented 6 years ago

Here are a few things that I have tried (unsuccessfully) to resolve this issue:

These were all potential hunches for source of the problem. I should reiterate that the script worked without any of this sort of magic on Win7.

brownag commented 6 years ago
> traceback()
17: rgdal::getRasterData(con, offset = offs, region.dim = reg, band = object@data@band)
16: .readRasterLayerValues(x, 1, x@nrows)
15: .local(x, ...)
14: getValues(x)
13: getValues(x)
12: .readCells(x, cells, 1)
11: .cellValues(object, cells, layer = layer, nl = nl)
10: .xyValues(x, coordinates(y), ..., df = df)
9: .local(x, y, ...)
8: raster::extract(r, s)
7: raster::extract(r, s)
6: data.frame(value = raster::extract(r, s), pID = s$pID, sid = s$sid)
5: (function (r) 
   {
       res <- data.frame(value = raster::extract(r, s), pID = s$pID, 
           sid = s$sid)
       return(res)
   })(X, ...)
4: rapply(raster.list, how = "replace", f = function(r) {
       res <- data.frame(value = raster::extract(r, s), pID = s$pID, 
           sid = s$sid)
       return(res)
   })
3: sampleRasterStackByMU(mu, mu.set, mu.col, raster.list, pts.per.acre, 
       estimateEffectiveSampleSize = correct.sample.size)
2: withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
1: suppressWarnings(sampleRasterStackByMU(mu, mu.set, mu.col, raster.list, 
       pts.per.acre, estimateEffectiveSampleSize = correct.sample.size)) at #10
brownag commented 6 years ago
> memory.size()
[1] 12758.71
> memory.limit()
[1] 16185

And from help(memory.size):

Environment variable R_MAX_MEM_SIZE provides another way to specify the initial limit.

It appears all 16GB of my RAM are available for use... as should be default on a 64-bit installation

brownag commented 6 years ago

Solution: raster::extract() is erroneously concluding (via canProcessInMemory()) these large operations can be done fully in memory, causing a limit to be hit before the theoretical maximum (i.e. the amount of unallocated RAM)

Here are the default raster options in current CRAN version of raster:

> rasterOptions()
format        : raster 
datatype      : FLT4S 
overwrite     : FALSE 
progress      : none 
timer         : FALSE 
chunksize     : 1e+08 
maxmemory     : 1e+10 
estimatemem   : FALSE 
tmpdir        : C:\Users\ANDREW~1.BRO\AppData\Local\Temp\Rtmp0weas5/raster/ 
tmptime       : 168 
setfileext    : TRUE 
tolerance     : 0.1 
standardnames : TRUE 
warn depracat.: TRUE 
header        : none

Setting maxmemory to 1E+09 resolves the issue (by setting the upper limit for an operation in memory to 1GB as opposed to 10GB:

rasterOptions(maxmemory=1E+09)

It seems that even when quite a bit of RAM is available (in the realm of 10GB or more) Windows is unable to allocate anything over ~6GB on my machine. Similar tests on Dylan's machine broke at just over 7.5GB. Setting the max memory to 1GB forces the larger operations to be done out of memory.

See this open pull request on the rspatial/raster page that proposes changes that would resolve this issue. https://github.com/rspatial/raster/pull/11

I think all reports that rely on constant-density sampling such as this should use a heuristic to estimate the best chunk size and max memory and set as needed. It appears like the intention will be for future CRAN versions of raster to have some sort of a fix for this.

dylanbeaudette commented 6 years ago

All MU summary/comparison reports are going to break with raster 2.7-15 when using regional 30m data. Some options:

The fact remains that sampling CONUS 30m rasters is no longer possible with the current version of raster and our Windows group security policy.

rhijmans commented 6 years ago

Would you be able to test if this problem goes away with the development version of raster? Available from R-Forge or github: https://r-forge.r-project.org/R/?group_id=294 https://github.com/rspatial/raster

dylanbeaudette commented 6 years ago

Hi Robert thanks for the suggestion. Is there a binary we can use? Unfortunately we don't have access to RTools or suitable compiler on our machines.

And of course, thank you for the continued development of the raster package. USDA-NRCS staff use it daily.

rhijmans commented 6 years ago

Hi Dylan, You can install from here:

install.packages("raster", repos="http://R-Forge.R-project.org")

but that only works on current R (3.5.1) Robert

On Thu, Nov 1, 2018 at 8:06 AM Dylan Beaudette notifications@github.com wrote:

Hi Robert thanks for the suggestion. Is there a binary we can use? Unfortunately we don't have access to RTools or suitable compiler on our machines.

And of course, thank you for the continued development of the raster package. USDA-NRCS staff use it daily.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ncss-tech/soilReports/issues/95#issuecomment-435064162, or mute the thread https://github.com/notifications/unsubscribe-auth/AK8xNa0klke2W4bS9GZ0zWhK4s7Mv5sbks5uqw32gaJpZM4YAGp6 .

brownag commented 6 years ago

I reopened this issue.

I should not have touted the maxmemory 'fix' as a fix. Cutting it down that low really cripples some of the bigger operations to the point where they may never finish.

A larger extent (relative to the one where I found this issue) I was working with this morning ran for approximately 40 minutes with no end in sight. I never saw it get bogged down, but I think the lowered memory threshold can't outweigh the cost of increase read/write (that appear to be imposed by our Windows 10 configuration)

dylanbeaudette commented 6 years ago

Thanks. As you say, we are stuck with a slightly older binary from r-forge (raster_2.7-8).

rasterOptions() reports:

...
chunksize     : 1e+07 
maxmemory     : 1e+09 
...

Testing with this version reveals that extract(r, s) with 1.2Gb raster and ~ 800k sample points takes longer than 2.5 hours. I am pretty sure that we were using this vintage of raster last time I did this analysis and it only took 12 minutes.

I suspect that this specific problem is related to USDA policy of 2 real-time scanning tasks that are swamping all disk access.

rhijmans commented 6 years ago

I am really surprised that changing the maxmemory setting did not work out. Here is a windows binary package of the current version. It would be great if you could try it: https://drive.google.com/drive/folders/1REkkVqwGrCdV3iHzQkCJETslsLXJjMVn?usp=sharing

dylanbeaudette commented 6 years ago

Thanks Robert. Installed and got this err:

Error: package or namespace load failed for ‘raster’ in inDL(x, as.logical(local), as.logical(now), ...):
 unable to load shared object 'C:/Users/Dylan.Beaudette/Documents/R/win-library/3.4/raster/libs/x64/raster.dll'
rhijmans commented 6 years ago

Weird, how did you install?

dylanbeaudette commented 6 years ago
install.packages('E:/temp/raster_2.8-3.zip', repos = NULL)

It installed without error, but throws an error when loading it with library()

Note that we are "stuck" at R 3.4.0.

dylanbeaudette commented 6 years ago

Here are the details for the raster in question

class       : RasterLayer 
dimensions  : 97293, 154195, 15002094135  (nrow, ncol, ncell)
resolution  : 30, 30  (x, y)
extent      : -2361803, 2264047, 258854.3, 3177644  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0 
data source : E:\gis_data\CONUS\CONUS-forms-DEB.tif 
names       : CONUS.forms.DEB 
values      : 0, 255  (min, max)
rhijmans commented 6 years ago

The problem was with the new release of raster. I tested with the code below and gave up after > 30 mins.

library(raster)

# r <- raster(nrow=97293, ncol=154195, ext=extent(-2361803, 2264047, 258854.3, 3177644), crs="+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m") 
# x <- init(r, fun='cell', filename="c:/temp/big.tif")

library(raster)
x <- raster("c:/temp/big.tif")
xy <- sampleRegular(raster(x), 800000, xy=TRUE)
v <- extract(x, xy)

It appears that this was caused by a change in raster via this pull request. I did not profile it; but rather than speeding things up, it seems to have a dramatic opposite effect; at least in some situation. I have reverted to the previous code, and now I get this

system.time(v <- extract(x, xy))
   user  system elapsed 
  23.70   11.97   35.92 

I get the same speed with the previous CRAN release.

dylanbeaudette commented 6 years ago

Thanks for testing Robert, this is a huge help! I tested on my Linux machine (R 3.4.1, raster 2.6-7) and it took about 19 minutes to complete. Your machine must be a lot faster than mine. Which version should we be on the look out for?

rhijmans commented 6 years ago

I have submitted raster 2.8-4 to CRAN. Hopefully it will be available sometime next week. Thanks for your help and patience.

dylanbeaudette commented 6 years ago

Thanks Robert for all of the help and testing. Looking forward to the new release.

rhijmans commented 6 years ago

The new version is on CRAN now

dylanbeaudette commented 6 years ago

Crud:

install.packages('c:/Temp/raster_2.8-4.zip', repos = NULL)

library(raster)
Error: package ‘raster’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version
In addition: Warning message:
package ‘raster’ was built under R version 3.5.1 

We are stuck with R 3.4.0. I wonder if CRAN will build the latest raster for r-oldrel?

dylanbeaudette commented 5 years ago

@rhijmans would you be willing to make us a custom raster_2.8 for R-oldrelease, via win-builder?

https://win-builder.r-project.org/

That would help us considerably while we wait for IT to get the current version of R.

dylanbeaudette commented 5 years ago

@brownag have we solved this issue? I no longer run into memory problems.

However, the raster sampling process takes 10x longer than it used to: probably related to 2x real-time scanning processes that are always running. I can watch this via task manager.

Do we still need the following:

raster::rasterOptions(maxmemory=1E+09)

?

brownag commented 5 years ago

adjusting of rasterOptions is not needed anymore. I have not run into memory problems, nor have I noticed any major differences in sampling... but I have not systematically tested the sampling speeds either.