r-spatial / stars

Spatiotemporal Arrays, Raster and Vector Data Cubes
https://r-spatial.github.io/stars/
Apache License 2.0
563 stars 94 forks source link

Parallel predict in stars #605

Closed kadyb closed 1 year ago

kadyb commented 1 year ago

There have been few questions and tweets recently about how parallel raster processing can be done in R, and I think it would be great to create such example for {stars}. I didn't see any such materials, but if it exists please link it. I think it would be useful for users to add such section to the Statistical modelling with stars objects vignette. I will prepare some example.

edzer commented 1 year ago

Are you thinking of a parallel predict?

edzer commented 1 year ago

For instance, predict.ranger, for RF prediction, already takes the number threads equal to the number of CPUs, by default.

kadyb commented 1 year ago

For instance, predict.ranger, for RF prediction, already takes the number threads equal to the number of CPUs, by default.

Yeah, some models support multi threaded prediction. I didn't benchmark whether it is better to use one large raster with multi threaded prediction or single threaded prediction, but on many small rasters simultaneously. Nevertheless, the user should be aware that there are two such options.

kadyb commented 1 year ago

Here I adapted the example from Statistical modelling with stars objects. For simplicity, I unpacked the raster archive and used only three 20 m bands (B8A, B11 and B12). I compared 1-threaded and 4-threaded processing using {randomForest} and I see that the speedup on my PC is more than 50%. Should we add this example to the vignette? And maybe something can be improved in this code below?

Code ```r library("stars") library("randomForest") bands = c("B8A", "B11", "B12") # 20 m resolution f = paste0("IMG_DATA/", "T32ULE_20180220T105051_", bands, ".jp2") r = read_stars(f, proxy = FALSE) # for (i in seq_along(bands)) r[[i]] = as.integer(r[[i]]) names(r) = bands cl = read_sf(system.file("gpkg/s2.gpkg", package = "stars")) cl = st_transform(cl, st_crs(r)) pts = st_sample(cl, 1000, "regular") pts = st_as_sf(pts) pts = st_intersection(pts, cl) train = st_extract(r, pts) train$use = as.factor(pts$use) train = data.frame(train) train$x = NULL # model rf = randomForest(use ~ ., train) ### single thread ### system.time({ pr = predict(r, rf) }) #> 400.63 ### multi thread (4) ### library("foreach") library("doParallel") cl = makeCluster(4) # number of cores registerDoParallel(cl) r = read_stars(f, proxy = TRUE) names(r) = bands tiles = st_tile(nrow(r), ncol(r), 512, 512) # size of tile if (!dir.exists("tiles")) dir.create("tiles") packages = c("stars", "randomForest") system.time({ foreach(i = seq_len(nrow(tiles)), .packages = packages) %dopar% { tile = read_stars(r, proxy = FALSE, RasterIO = tiles[i, ], NA_value = 0) pr = predict(tile, rf) save_path = file.path("tiles", paste0("tile_", i, ".tif")) write_stars(pr, save_path, NA_value = 0, options = "COMPRESS=NONE", type = "Byte", chunk_size = dim(tile)) return(0) # don't return pr object as default } tiles = list.files("tiles", pattern = ".tif$", full.names = TRUE) tmp = tempfile(fileext = ".vrt") sf::gdal_utils(util = "buildvrt", source = tiles, destination = tmp) sf::gdal_utils(util = "translate", source = tmp, destination = "predict.tif") }) #> 166.61 stopCluster(cl) ```
edzer commented 1 year ago

Yes, that is one way, I was thinking about passing for instance cl to predict.stars, and chunk the data.frame in pieces that are predicted in parallel. We're reading the whole thing in memory anyway, so why write & read the chunks to & from disk?

kadyb commented 1 year ago

I tested this on 8 GB RAM, so if we can load everything into memory it will be even better. Multi threaded prediction on chunked data frame is very good idea too.

edzer commented 1 year ago

Ah, it took me a while to understand - the %dopar% makes sure that no more than 4 tiles are being read & processed at the same time, right?

kadyb commented 1 year ago

That is right! And of course we can set the optimal number of cores and tile size.

kadyb commented 1 year ago

Here is another attempt by splitting the data frame into chunks. I'm not sure if the strategy of replacing NA with 0 and then predict is optimal. Some algorithms are quite slow in prediction, and maybe it would be better to remove such rows from the data frame and then reconstruct the vector with NA. But I didn't check if it really matters.

Code ```r library("foreach") library("doParallel") cl = makeCluster(4) registerDoParallel(cl) bands = c("B8A", "B11", "B12") f = paste0("IMG_DATA/", "T32ULE_20180220T105051_", bands, ".jp2") r = read_stars(f, proxy = FALSE) names(r) = bands df = data.frame(r) crds = df[, 1:2] df = df[, -c(1, 2)] chunks = 50 # number of chunks NA_found = anyNA(df) if (NA_found) { idx = which(complete.cases(df)) df[-idx, ] = 0 } # make sure to use the `each` argument, otherwise the rows will be shuffled df = split(df, rep(1:chunks, each = nrow(df) / chunks)) packages = c("stars", "randomForest") system.time({ pr = foreach(i = seq_along(df), .combine = c, .packages = packages) %dopar% { predict(rf, df[[i]]) } if (NA_found) pr[-idx] = NA output = cbind(crds, pr) output = st_as_stars(output, st_crs = st_crs(r)) }) #> 167.85 stopCluster(cl) ```
edzer commented 1 year ago

I think that your first example, the one using tiling and proxy=TRUE is worth showing in the ML vignette. Would you mind adding it?

kadyb commented 1 year ago

The proposal in #606.

edzer commented 1 year ago

I routinely run R CMD check before I commit anything. It doesn't work for me if that takes too long, as it does with this PR. It would be great if these sections were only run on CI so they end up on the pkgdown site.

kadyb commented 1 year ago

Wouldn't it be better to disable the execution of this part of the vignette at all? It is likely to run quite slowly on CI too.

kadyb commented 1 year ago

I recently rewrote this content into a separate tutorial. Maybe you would be interested in publishing this on the R-spatial blog instead of extending the vignette, which greatly increases the build time? If you have any comments about the tutorial, let me know, of course.

Rmd: https://github.com/kadyb/stars-parallel/blob/main/Tutorial.Rmd HTML: https://kadyb.github.io/stars-parallel/Tutorial.html

kadyb commented 1 year ago

So finally we have blogpost published here: https://r-spatial.org/r/2023/09/21/stars-parallel.html 🎉

edzer commented 1 year ago

We should also make a reference from one of the vignettes to this blog post.