r-spatial / stars

Spatiotemporal Arrays, Raster and Vector Data Cubes
https://r-spatial.github.io/stars/
Apache License 2.0
566 stars 95 forks source link

Function to split raster into blocks #362

Closed kadyb closed 3 years ago

kadyb commented 4 years ago

I wrote a function that splits the raster into smaller blocks. Would you be interested to see it and maybe include in stars package? It is designed to work with read_stars(RasterIO).

blocks

kadyb commented 3 years ago

I put the code here, maybe someone will be interested and use it.

# divide the raster into smaller blocks
# as input arguments, enter the file path and the width/height of the blocks

getBlocks = function(path, x_window, y_window) {

  x_window = as.integer(x_window)
  y_window = as.integer(y_window)

  img = stars::read_stars(path, proxy = TRUE)
  img_rows = dim(img)[["y"]]
  img_cols = dim(img)[["x"]]

  x_vec = integer()
  y_vec = integer()
  nXSize_vec = integer()
  nYSize_vec = integer()

  for (x in seq.int(1L, img_rows, y_window)) {

    if (x + y_window < img_rows) {
      nXSize = y_window
    } else {
      nXSize = img_rows - x + 1L
    }

    for (y in seq.int(1L, img_cols, x_window)) {

      if (y + x_window < img_cols) {
        nYSize = x_window
      } else {
        nYSize = img_cols - y + 1L
      }

      x_vec = c(x_vec, x)
      y_vec = c(y_vec, y)
      nXSize_vec = c(nXSize_vec, nXSize)
      nYSize_vec = c(nYSize_vec, nYSize)
    }

  }

  mat = matrix(c(x_vec, y_vec, nXSize_vec, nYSize_vec), ncol = 4)
  colnames(mat) = c("x", "y", "nXSize", "nYSize")
  rownames(mat) = seq_len(nrow(mat))

  return(mat)

}

The code below reads and processes the file in blocks.

path = system.file("tif/olinda_dem_utm25s.tif", package = "stars")
blocks = getBlocks(path, 50, 50)

for (bl in seq_len(nrow(blocks))) {

  rasterio = list(nXOff = blocks[bl, "x"], nYOff = blocks[bl, "y"],
                  nXSize = blocks[bl, "nXSize"], nYSize = blocks[bl, "nYSize"])
  tile = stars::read_stars(path, RasterIO = rasterio)

  # process blocks
  # ...

}
edzer commented 3 years ago

Thanks for contributing! What was the motivation for this?

kadyb commented 3 years ago

In my case, I have a large raster that cannot be fully loaded into memory. The splitting into smaller blocks takes less memory, so finally I can process all data.

edzer commented 3 years ago

OK, and setting proxy = TRUE in read_stars() didn't work out?

kadyb commented 3 years ago

Honestly, I didn't try it. I have over 50 GB of data in a stack of 10 layers. In my project I have to perform data transformation and then unsupervised classification. In raster package, I have memory allocation problems, so that's why I decided to try stars.

I may be unaware, but there is no possibility for the stars to automatically process data in blocks?

edzer commented 3 years ago

Yes, see https://r-spatial.github.io/stars/articles/stars2.html . There's a predict method for stars_proxy objects, and write_stars lets you define a chunk size.

How did you train the model? On chunks, or on a random sample?

kadyb commented 3 years ago

I trained the model on randomly selected chunks (blocks).

kadyb commented 3 years ago

I tried the example below, but I don't have enough RAM. Loading some data solves the problem, like pull(raster[1, 1:10000, 5000:15000]). It seems to me that a way of automatic reading of some data would be very helpful if the whole raster does not fit in the memory. Something similar to the getBlocks() proposed by me.

library("stars")
library("dplyr")

raster = read_stars("test.tif", proxy = TRUE) # 4 GB
raster_vals = pull(raster)
raster_vals = st_as_stars(raster_vals)
raster_vals = as.vector(raster_vals)
raster_vals = na.omit(raster_vals)
#> Error: cannot allocate vector of size 1.8 Gb

# then do the clustering
edzer commented 3 years ago

Would you mind sharing test.tif with me somehow?

edzer commented 3 years ago

Never mind, I can work with the sample data in the vignette mentioned above.

kadyb commented 3 years ago

In case, here is a reproducible example:

library("raster")
library("stars")
library("dplyr")

# adjust 'ncol' and 'nrow' values according to RAM
ras = raster(ncol = 20000, nrow = 20000)
idx = sample(length(ras), 500000)
values(ras) = runif(ncell(ras))
values(ras)[idx] = NA
writeRaster(ras, "test.tif")
rm(ras, idx)

raster = read_stars("test.tif", proxy = TRUE)
raster_vals = pull(raster)
raster_vals = st_as_stars(raster_vals)
raster_vals = as.vector(raster_vals)
raster_vals = na.omit(raster_vals)
#> Error: cannot allocate vector of size 1.5 Gb
edzer commented 3 years ago

This fixes some small things.

So here is what I did. kmeans does not have a predict method, so I created one, wrapping clue::cl_predict.

library(clue)
library(tidyverse)

predict.kmeans = function(object, newdata, ...) {
    unclass(clue::cl_predict(object, newdata[, -c(1:2)], ...))
}

cl_n = function(obj, nclus, nsample, ...) {
    sam = st_sample(obj, nsample)
    k = kmeans(na.omit(as.data.frame(sam)[, -c(1:2)]), nclus)
    predict(obj, k, ...)
}

library(stars)
# small dataset
tif = system.file("tif/L7_ETMs.tif", package = "stars")
i = read_stars(tif, proxy = TRUE) %>%
    split()
ncl = 5
out = cl_n(i, ncl, 1000)
write_stars(out, "out.tif")

# img:
png("out.png")
plot(read_stars("out.tif", proxy=TRUE), col = sf.colors(ncl, categorical=TRUE))
dev.off()

# large dataset:
granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip", package = "starsdata")
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
i = read_stars(s2, proxy = TRUE, NA_value = 0) %>%
    split()
out = cl_n(i, ncl, 1000)

# this will do the FULL computations, takes a few minutes:
#write_stars(out, "out2.tif") 

# img: does the computation only for pixels plotted:
png("out2.png")
plot(out, col = sf.colors(ncl, categorical=TRUE))
dev.off()

First image: out Second image: out2

edzer commented 3 years ago

Coming back to what you tried: st_as_stars() reads a proxy object into memory, so the required memory is needed. In my example above, st_sample is used on the proxy object, and (now) returns a much reduced stars object (with coarse resolution; this needs some work though, I don't think there is a random offset in the sampling for instance). Only write_stars() and plot() trigger execution of the predict step, where plot benefits from the information about the amount of pixels needed to be computed, so can do this fast.

kadyb commented 3 years ago

Thanks for your work! I think this issue is very important in the age of processing huge sets of satellite data. Your method can be a good alternative to block processing. I will test it in the future.

kadyb commented 3 years ago

@edzer, could you explain how to do it in below case? I would like to do some preprocessing as an extra step using recipes, but with no luck.

library("stars")
library("mclust")
library("recipes")

tif = system.file("tif/L7_ETMs.tif", package = "stars")
ras = split(read_stars(tif))
df = as.data.frame(ras)[1:20000, -(1:2)]

rec = recipe( ~ ., data = df)
rec = step_normalize(rec, all_numeric())
rec = step_pca(rec, all_numeric(), num_comp = 3)
transformator = prep(rec, training = df, retain = FALSE)
df = bake(transformator, df, composition = "data.frame")

mdl = Mclust(df, G = 5)

pred = function(x, transformator, mdl) {
  x = na.omit(as.data.frame(x))
  x = x[, -(1:2)]
  x = bake(transformator, x, composition = "data.frame")
  return(predict(x, mdl))
}

pred(ras, transformator, mdl)
#> Error in UseMethod("predict") :
#>   no applicable method for 'predict' applied to an object of class "data.frame"
edzer commented 3 years ago

You probably want to call predict(mdl, x), as there is no predict method for data.frame (such a method would not make sense: how should it predict? - S3 method dispatch uses the first argument).

kadyb commented 3 years ago

Yes, predict(mdl, x) works. It returns two lists - classification and z, but shouldn't I get the stars object as in examples PCA and k-means? As a result of the prediction, I would like to get a raster. What step am I missing?

edzer commented 3 years ago

Something like

pred = function(x, transformator, mdl) {
  df = na.omit(as.data.frame(x))
  df = df[, -(1:2)]
  df = bake(transformator, x, composition = "data.frame")
  x$cluster = predict(mdl, df)$classification
  x["cluster"]
}
plot(pred(ras, transformator, mdl), col = sf.colors(5, categorical = TRUE))

x

This obviously breaks if the raster has missing values.

kadyb commented 3 years ago

Thanks!

This obviously breaks if the raster has missing values.

My raster has missing values, so I will have to solve this.

UPDATE: After considering Edzer's comment below, here is my way to deal with NAs in raster.

pred = function(x, transformator, mdl) {
  NA_idx = is.na(x)
  NA_idx = unlist(NA_idx, use.names = FALSE)
  dim(NA_idx) = c(prod(dim(x)), length(x))
  NA_idx = apply(NA_idx, 1, any)
  dim(NA_idx) = c(nrow(x), ncol(x))

  df = as.data.frame(x)[, -(1:2)]
  df = na.omit(df)
  df = bake(transformator, df, composition = "data.frame")
  x$cluster = rep(NA_integer_, prod(dim(x)))
  x$cluster[!NA_idx] = predict(mdl, df)$classification
  x["cluster"]
}
edzer commented 3 years ago

My raster has missing values, so I will have to solve this.

with na.omit you loose their position. Some (good) predict methods deal with them properly, if not you'll have to find their locations, and "insert" them back later into the vector of predicted classes.

kadyb commented 3 years ago

https://github.com/r-spatial/stars/issues/362#issuecomment-874667243

This code works for proxy = FALSE, but should it work for proxy = TRUE? I noticed strange behavior if proxy = TRUE, probably stars creates ncol * nrow files(?). Should the code look different? Sorry for the inconvenience, I know these questions can be annoying.

library("stars")
library("mclust")
library("recipes")

tif = system.file("tif/L7_ETMs.tif", package = "stars")
ras = read_stars(tif, proxy = TRUE)
ras = split(ras)
df = as.data.frame(ras)[1:20000, -(1:2)]

rec = recipe( ~ ., data = df)
rec = step_normalize(rec, all_numeric())
rec = step_pca(rec, all_numeric(), num_comp = 3)
transformator = prep(rec, training = df, retain = FALSE)
df = bake(transformator, df, composition = "data.frame")

mdl = Mclust(df, G = 4)

pred = function(x, transformator, mdl) {
  df = na.omit(as.data.frame(x))
  df = df[, -(1:2)]
  df = bake(transformator, df, composition = "data.frame")
  x$cluster = predict(mdl, df)$classification
  x["cluster"]
}

test = pred(ras, transformator, mdl)
test
#> stars_proxy object with 1 attribute in 122848 file(s); showing the first 10 filenames
#> Error in basename(x) : a character vector argument expected

Edit: x["cluster"] is an integer vector of ncol * nrow length, not I previously thought files.

edzer commented 3 years ago

For this to work, you'd have to use the predict method for the stars_proxy object, as you want it to be evaluated either on low resolution (e.g. for plot) or chunk-wise (e.g. for write); right now, you ask R to call predict on the entire raster. This vignette might be helpful; let me know if you get stuck.

kadyb commented 3 years ago

Thanks for the explanation! I haven't solved it yet, but I'm still struggling. I wonder if predict.stars should remove the first two columns with geometry (x and y)? Because it seems to me that some predictive models check the number of columns (including mclust) and return an error if they don't match.

library("stars")
library("mclust")

tif = system.file("tif/L7_ETMs.tif", package = "stars")
r = split(read_stars(tif, proxy = TRUE))

# PCA
pc = prcomp(as.data.frame(r)[, -(1:2)]) # based on all data
out = predict(r, pc)
plot(out)

# GMM
gmm = Mclust(as.data.frame(r)[1:10000, -(1:2)], G = 4)
out = predict(r, gmm)
plot(out)
#> Error in predict.Mclust(model, obj_df, ...) : 
#>   newdata must match ncol of object data
edzer commented 3 years ago

That is of course absolutely stupid from mclust, but hard to change. This change allows you to do a

out = predict(r, gmm, drop_dimensions = TRUE)

for cases like this.

kadyb commented 2 years ago

@edzer, what do you think about adding a helper function (as I suggested in my second post here) to the package that will return nXOff, nYOff, nXsize, nYSize parameters? Personally, I use this often and believe it can be helpful in deep learning workflows. I can prepare a PR, if you are interested.

edzer commented 2 years ago

Great idea!