Closed kadyb closed 3 years ago
I put the code here, maybe someone will be interested and use it.
# divide the raster into smaller blocks
# as input arguments, enter the file path and the width/height of the blocks
getBlocks = function(path, x_window, y_window) {
x_window = as.integer(x_window)
y_window = as.integer(y_window)
img = stars::read_stars(path, proxy = TRUE)
img_rows = dim(img)[["y"]]
img_cols = dim(img)[["x"]]
x_vec = integer()
y_vec = integer()
nXSize_vec = integer()
nYSize_vec = integer()
for (x in seq.int(1L, img_rows, y_window)) {
if (x + y_window < img_rows) {
nXSize = y_window
} else {
nXSize = img_rows - x + 1L
}
for (y in seq.int(1L, img_cols, x_window)) {
if (y + x_window < img_cols) {
nYSize = x_window
} else {
nYSize = img_cols - y + 1L
}
x_vec = c(x_vec, x)
y_vec = c(y_vec, y)
nXSize_vec = c(nXSize_vec, nXSize)
nYSize_vec = c(nYSize_vec, nYSize)
}
}
mat = matrix(c(x_vec, y_vec, nXSize_vec, nYSize_vec), ncol = 4)
colnames(mat) = c("x", "y", "nXSize", "nYSize")
rownames(mat) = seq_len(nrow(mat))
return(mat)
}
The code below reads and processes the file in blocks.
path = system.file("tif/olinda_dem_utm25s.tif", package = "stars")
blocks = getBlocks(path, 50, 50)
for (bl in seq_len(nrow(blocks))) {
rasterio = list(nXOff = blocks[bl, "x"], nYOff = blocks[bl, "y"],
nXSize = blocks[bl, "nXSize"], nYSize = blocks[bl, "nYSize"])
tile = stars::read_stars(path, RasterIO = rasterio)
# process blocks
# ...
}
Thanks for contributing! What was the motivation for this?
In my case, I have a large raster that cannot be fully loaded into memory. The splitting into smaller blocks takes less memory, so finally I can process all data.
OK, and setting proxy = TRUE
in read_stars()
didn't work out?
Honestly, I didn't try it. I have over 50 GB of data in a stack of 10 layers. In my project I have to perform data transformation and then unsupervised classification. In raster package, I have memory allocation problems, so that's why I decided to try stars.
I may be unaware, but there is no possibility for the stars to automatically process data in blocks?
Yes, see https://r-spatial.github.io/stars/articles/stars2.html . There's a predict
method for stars_proxy
objects, and write_stars
lets you define a chunk size.
How did you train the model? On chunks, or on a random sample?
I trained the model on randomly selected chunks (blocks).
I tried the example below, but I don't have enough RAM. Loading some data solves the problem, like pull(raster[1, 1:10000, 5000:15000])
.
It seems to me that a way of automatic reading of some data would be very helpful if the whole raster does not fit in the memory. Something similar to the getBlocks()
proposed by me.
library("stars")
library("dplyr")
raster = read_stars("test.tif", proxy = TRUE) # 4 GB
raster_vals = pull(raster)
raster_vals = st_as_stars(raster_vals)
raster_vals = as.vector(raster_vals)
raster_vals = na.omit(raster_vals)
#> Error: cannot allocate vector of size 1.8 Gb
# then do the clustering
Would you mind sharing test.tif
with me somehow?
Never mind, I can work with the sample data in the vignette mentioned above.
In case, here is a reproducible example:
library("raster")
library("stars")
library("dplyr")
# adjust 'ncol' and 'nrow' values according to RAM
ras = raster(ncol = 20000, nrow = 20000)
idx = sample(length(ras), 500000)
values(ras) = runif(ncell(ras))
values(ras)[idx] = NA
writeRaster(ras, "test.tif")
rm(ras, idx)
raster = read_stars("test.tif", proxy = TRUE)
raster_vals = pull(raster)
raster_vals = st_as_stars(raster_vals)
raster_vals = as.vector(raster_vals)
raster_vals = na.omit(raster_vals)
#> Error: cannot allocate vector of size 1.5 Gb
This fixes some small things.
So here is what I did. kmeans
does not have a predict
method, so I created one, wrapping clue::cl_predict
.
library(clue)
library(tidyverse)
predict.kmeans = function(object, newdata, ...) {
unclass(clue::cl_predict(object, newdata[, -c(1:2)], ...))
}
cl_n = function(obj, nclus, nsample, ...) {
sam = st_sample(obj, nsample)
k = kmeans(na.omit(as.data.frame(sam)[, -c(1:2)]), nclus)
predict(obj, k, ...)
}
library(stars)
# small dataset
tif = system.file("tif/L7_ETMs.tif", package = "stars")
i = read_stars(tif, proxy = TRUE) %>%
split()
ncl = 5
out = cl_n(i, ncl, 1000)
write_stars(out, "out.tif")
# img:
png("out.png")
plot(read_stars("out.tif", proxy=TRUE), col = sf.colors(ncl, categorical=TRUE))
dev.off()
# large dataset:
granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip", package = "starsdata")
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
i = read_stars(s2, proxy = TRUE, NA_value = 0) %>%
split()
out = cl_n(i, ncl, 1000)
# this will do the FULL computations, takes a few minutes:
#write_stars(out, "out2.tif")
# img: does the computation only for pixels plotted:
png("out2.png")
plot(out, col = sf.colors(ncl, categorical=TRUE))
dev.off()
First image: Second image:
Coming back to what you tried: st_as_stars()
reads a proxy object into memory, so the required memory is needed. In my example above, st_sample
is used on the proxy object, and (now) returns a much reduced stars object (with coarse resolution; this needs some work though, I don't think there is a random offset in the sampling for instance). Only write_stars()
and plot()
trigger execution of the predict step, where plot benefits from the information about the amount of pixels needed to be computed, so can do this fast.
Thanks for your work! I think this issue is very important in the age of processing huge sets of satellite data. Your method can be a good alternative to block processing. I will test it in the future.
@edzer, could you explain how to do it in below case? I would like to do some preprocessing as an extra step using recipes, but with no luck.
library("stars")
library("mclust")
library("recipes")
tif = system.file("tif/L7_ETMs.tif", package = "stars")
ras = split(read_stars(tif))
df = as.data.frame(ras)[1:20000, -(1:2)]
rec = recipe( ~ ., data = df)
rec = step_normalize(rec, all_numeric())
rec = step_pca(rec, all_numeric(), num_comp = 3)
transformator = prep(rec, training = df, retain = FALSE)
df = bake(transformator, df, composition = "data.frame")
mdl = Mclust(df, G = 5)
pred = function(x, transformator, mdl) {
x = na.omit(as.data.frame(x))
x = x[, -(1:2)]
x = bake(transformator, x, composition = "data.frame")
return(predict(x, mdl))
}
pred(ras, transformator, mdl)
#> Error in UseMethod("predict") :
#> no applicable method for 'predict' applied to an object of class "data.frame"
You probably want to call predict(mdl, x)
, as there is no predict
method for data.frame
(such a method would not make sense: how should it predict? - S3 method dispatch uses the first argument).
Something like
pred = function(x, transformator, mdl) {
df = na.omit(as.data.frame(x))
df = df[, -(1:2)]
df = bake(transformator, x, composition = "data.frame")
x$cluster = predict(mdl, df)$classification
x["cluster"]
}
plot(pred(ras, transformator, mdl), col = sf.colors(5, categorical = TRUE))
This obviously breaks if the raster has missing values.
Thanks!
This obviously breaks if the raster has missing values.
My raster has missing values, so I will have to solve this.
UPDATE: After considering Edzer's comment below, here is my way to deal with NAs in raster.
pred = function(x, transformator, mdl) {
NA_idx = is.na(x)
NA_idx = unlist(NA_idx, use.names = FALSE)
dim(NA_idx) = c(prod(dim(x)), length(x))
NA_idx = apply(NA_idx, 1, any)
dim(NA_idx) = c(nrow(x), ncol(x))
df = as.data.frame(x)[, -(1:2)]
df = na.omit(df)
df = bake(transformator, df, composition = "data.frame")
x$cluster = rep(NA_integer_, prod(dim(x)))
x$cluster[!NA_idx] = predict(mdl, df)$classification
x["cluster"]
}
My raster has missing values, so I will have to solve this.
with na.omit
you loose their position. Some (good) predict
methods deal with them properly, if not you'll have to find their locations, and "insert" them back later into the vector of predicted classes.
https://github.com/r-spatial/stars/issues/362#issuecomment-874667243
This code works for proxy = FALSE
, but should it work for proxy = TRUE
? I noticed strange behavior if proxy = TRUE
, probably . Should the code look different? Sorry for the inconvenience, I know these questions can be annoying.stars
creates ncol * nrow files(?)
library("stars")
library("mclust")
library("recipes")
tif = system.file("tif/L7_ETMs.tif", package = "stars")
ras = read_stars(tif, proxy = TRUE)
ras = split(ras)
df = as.data.frame(ras)[1:20000, -(1:2)]
rec = recipe( ~ ., data = df)
rec = step_normalize(rec, all_numeric())
rec = step_pca(rec, all_numeric(), num_comp = 3)
transformator = prep(rec, training = df, retain = FALSE)
df = bake(transformator, df, composition = "data.frame")
mdl = Mclust(df, G = 4)
pred = function(x, transformator, mdl) {
df = na.omit(as.data.frame(x))
df = df[, -(1:2)]
df = bake(transformator, df, composition = "data.frame")
x$cluster = predict(mdl, df)$classification
x["cluster"]
}
test = pred(ras, transformator, mdl)
test
#> stars_proxy object with 1 attribute in 122848 file(s); showing the first 10 filenames
#> Error in basename(x) : a character vector argument expected
Edit: x["cluster"]
is an integer vector of ncol * nrow length, not I previously thought files.
For this to work, you'd have to use the predict
method for the stars_proxy
object, as you want it to be evaluated either on low resolution (e.g. for plot
) or chunk-wise (e.g. for write); right now, you ask R to call predict
on the entire raster. This vignette might be helpful; let me know if you get stuck.
Thanks for the explanation! I haven't solved it yet, but I'm still struggling. I wonder if predict.stars
should remove the first two columns with geometry (x
and y
)? Because it seems to me that some predictive models check the number of columns (including mclust
) and return an error if they don't match.
library("stars")
library("mclust")
tif = system.file("tif/L7_ETMs.tif", package = "stars")
r = split(read_stars(tif, proxy = TRUE))
# PCA
pc = prcomp(as.data.frame(r)[, -(1:2)]) # based on all data
out = predict(r, pc)
plot(out)
# GMM
gmm = Mclust(as.data.frame(r)[1:10000, -(1:2)], G = 4)
out = predict(r, gmm)
plot(out)
#> Error in predict.Mclust(model, obj_df, ...) :
#> newdata must match ncol of object data
That is of course absolutely stupid from mclust
, but hard to change. This change allows you to do a
out = predict(r, gmm, drop_dimensions = TRUE)
for cases like this.
@edzer, what do you think about adding a helper function (as I suggested in my second post here) to the package that will return nXOff
, nYOff
, nXsize
, nYSize
parameters? Personally, I use this often and believe it can be helpful in deep learning workflows. I can prepare a PR, if you are interested.
Great idea!
I wrote a function that splits the raster into smaller blocks. Would you be interested to see it and maybe include in stars package? It is designed to work with
read_stars(RasterIO)
.