rspatial / terra

R package for spatial data handling https://rspatial.github.io/terra/reference/terra-package.html
GNU General Public License v3.0
532 stars 86 forks source link

segfault with large rasters #237

Closed nkruskamp closed 2 years ago

nkruskamp commented 3 years ago

Hi, I am running into an issue I'm not sure how to interpret or troubleshoot. I am processing some large raster files with hundreds of layers (e.g. 2100 x 4650 x 520). There is a step where I need to replace NA values with zeros. During this operation, I get a segfault error that will kill my code.

I think the below should be a reproduce-able example that creates a random raster of the same dimensions as one of my datasets, that will then result in the segfault (There is probably an easier and faster way to generate similar fake data to reproduce error? please advise if so!)

library(terra)

# a pretty big raster
n_col <- 2100
n_row <- 4650
n_lay <- 520

# create a random raster and save it
raster_path <- "C:/tmp/random_raster_1.tif"
r <- rast(nrows = n_row, ncols = n_col, nlyr = n_lay)
values(r) <- runif(n_col * n_row * n_lay)
writeRaster(r, raster_path)

# read random raster file, try to reclassify
print("reading raster")
r <- rast(raster_path)
print("reclassifying raster")
# here is where segfault occurs
r2 <- classify(r, matrix(c(NA, 0), ncol = 2, byrow = TRUE), right = NA)
print("done.")

Output from sessionInfo():

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] terra_1.2-10

loaded via a namespace (and not attached):
[1] compiler_4.0.3   sp_1.4-5         Rcpp_1.0.6       raster_3.4-10    codetools_0.2-18 grid_4.0.3       lattice_0.20-44

Thanks!

nkruskamp commented 3 years ago

I've gone back through my entire data processing workflow and realized with some help that the issue is likely due to my use of the TILED creation option in GDAL for these very large rasters. At some point I added that in to my code, copied over from previous work that was easily done in parallel by TIF blocks.

Unfortunately the above code was something I threw together not thinking through the entire issue, as I don't think terra by default uses those creation options when writing files and therefore wouldn't cause the error.

By just commenting out the TILED=YES option and the block sizes, I no longer get the segfault using the above code with the original data.

co = [
    "BIGTIFF=IF_SAFER",
    # "BLOCKXSIZE=1024",
    # "BLOCKYSIZE=1024",
    # "TILED=YES",
    "COMPRESS=LZW",
    "NUM_THREADS=ALL_CPUS",
]

Someone more knowledgeable of GDAL and TIF formats might be able to explain why this is the case.

rhijmans commented 3 years ago

That is good to know about. I have not experimented with TILED-YES so I am reopening this so that I do not forget to do that. By the way, the default in terra is to use COMPRESS=LZW and to use BIGTIFF is necessary, so you should not have to specify these options.

nkruskamp commented 3 years ago

The code in my original post was thrown together to make a minimum reproducible example before I realized it was related to tiled tifs. So apologies if that sent you down the wrong path.

I primarily code in python, so I use rasterio (or gdal in this case) for data processing. The ecological model I am using is in R with terra under the hood for raster handling which brought me here. But it appears that no geospatial software can read these files. And the issue applies to both VRTs that use single band raster files created with the tiled option as well as multiband TIF files that are tiled. So this issue is not unique to terra but more likely with GDAL and/or the TIF format. Additionally, this isn't an issue when the file is under a certain size but I'm not sure where the threshold is in file size/number of pixels.

For reference, here's a snippet of python code where individual raster bands are stacked in VRT and TIF format using the creation options. These are the files that will fail if TILED=YES is used.

# set GDAL creation options, NOT TILED
co = [
    "BIGTIFF=IF_SAFER",
    # "BLOCKXSIZE=1024",
    # "BLOCKYSIZE=1024",
    # "TILED=YES",
    "COMPRESS=LZW",
    "NUM_THREADS=ALL_CPUS",
]
# get list of single band raster files
# ~520 files total
files = src_dir.glob("*.tif")
# set name of output file
dst_file = dst_dir / f"{file_name}"
# create VRT, separate bands
# make paths strings, GDAL won't accept Path objects
my_ras = gdal.BuildVRT(
    str(dst_file.with_suffix(".vrt")),
    [str(x) for x in files],
    options=gdal.BuildVRTOptions(
        # xRes=scale,
        # yRes=scale,
        # resampleAlg="nearest",
        separate=True,
    ),
)
my_ras = None
# create multiband TIF from VRT using above creation options.
my_ras = gdal.Translate(
    str(dst_file.with_suffix(".tif")),
    str(dst_file.with_suffix(".vrt")),
    options=gdal.TranslateOptions(creationOptions=co),
)
my_ras = None