Closed pat-s closed 3 years ago
The terra data model is much simpler: it doesn't have attributes, and only copes with stacks of raster - whatever the stack "dimension" represents.
I'd like to make this simpler, but suppose you have a time series of spectral imagery, and time and spectral are both dimensions (meaning you have a single attribute, four-dimensional data cube) - how could a predict
know what the different predictors are: the time instances, the spectral bands, or all unique combinations of time x band?
Yes I see the issues.
I haven't yet dived into the multi-dim capabilities of {stars}, e.g. a variable with observations across multiple timestamps. For now, I was solely focusing on getting spatial-only prediction right. Maybe one option would be to let {stars} try some "smart" defaults, based on the dimension metadata information it has and report this during the call (something like "Detected XY, using Z as TZ" and so forth?
Some potentially helpful ideas:
stars::predict()
could just focus on the spatial dimension and try to "split" these internally and then do the prediction.I also see the advantages of unfolding the stars object into a data.frame first and then do predict.data.frame()
and handing the variable curation over to the user. In {mlr3spatial} the return would be a {mlr3::Prediction} object and optionally a stars
object written to disk - in {mlr3spatial} we anyhow do the prediction on a data.table internally, also for {terra} and {raster} objects. The use case for me only came up as I am about to benchmark predict() calls for various spatial packages and stumbled over the multi-band predict issue mentioned here :)
I agree that the simple case: a 3 D, single attr cube, where the third needs to be split for predict, could be a good default (with a message, and after trying & failing to call predict
on that single attribute), and will look into that.
This should do it; would be great if you could check!
Thanks!
In my reprex it does not work yet:
library(mlr3spatial)
#> Loading required package: mlr3
# remotes::install_github("mlr-org/mlr3spatial")
library(mlr3)
library(stars)
#> Loading required package: abind
#> Loading required package: sf
#> Linking to GEOS 3.9.1, GDAL 3.3.1, PROJ 8.1.0
stack_rasterbrick <- demo_stack_rasterbrick(size = 1, layers = 5)
stack_stars <- stars::st_as_stars(stack_rasterbrick)
backend_stars <- as_stars_backend(stack_stars, quiet = TRUE, response = "y.1", response_is_factor = TRUE)
task_stars <- as_task_classif(backend_stars, target = "y.1", positive = "1")
set.seed(42)
row_ids <- sample(1:task_stars$nrow, 500)
e1071svm <- e1071::svm(y.1 ~ ., task_stars$data(rows = row_ids))
predict(e1071svm, stack_stars)
#> Error in eval(predvars, data, env): object 'x_2' not found
# works
stack_stars_df <- as.data.frame(split(stack_stars, "band"))
head(predict(e1071svm, stack_stars_df))
#> 1 2 3 4 5 6
#> 1 1 1 1 1 1
#> Levels: 1 0
Created on 2021-08-31 by the reprex package (v2.0.1)
That's because predict.stars
seems to be never called - how would stars
know it needs to call split()
on the object?
> predict(e1071svm, stack_stars)
Error in eval(predvars, data, env) : object 'x_2' not found
> traceback()
8: eval(predvars, data, env)
7: eval(predvars, data, env)
6: model.frame.default(object, data, xlev = xlev)
5: model.frame(object, data, xlev = xlev)
4: model.matrix.default(delete.response(terms(object)), as.data.frame(newdata))
3: model.matrix(delete.response(terms(object)), as.data.frame(newdata))
2: predict.svm(e1071svm, stack_stars)
1: predict(e1071svm, stack_stars)
Ah well, that was a dumb ordering mistake by me - ofc the stars
object needs to be first in line.
Yet I get the following then
predict(stack_stars, e1071svm)
prediction on the entire object failed; will try to split() bands over attributes
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘"try-error"’ to a data.frame
Backtrace:
1: stop(gettextf("cannot coerce class %s to a data.frame", sQuote(deparse(class(x))[1L])), 2: as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
3: as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
4: data.frame(prediction = pr)
5: predict.stars(stack_stars, e1071svm)
6: predict(stack_stars, e1071svm)
Could it be that predict(split(object), model, ..., drop_dimensions = drop_dimensions)
should be predict(split(object, "band"), model, ..., drop_dimensions = drop_dimensions)
to make it work?
Ah, yes, recursion always needs care. This should do it - pls try.
Looking good!
I am wondering in which case "prediction on the entire object" would succeed here. I got to admit that I am also having a bit of trouble with the terminology used in the message, especially "entire object". In this particular case I would assume that "the entire object" would make use of all attributes of the object and hence succeed - or would that mean here that only the first attribute would be used unless split()
is applied?
Maybe a refinement of the message using the terms "dimension", "attribute" and "columns" could help clarify from a users perspective?
library(mlr3spatial)
#> Loading required package: mlr3
# remotes::install_github("mlr-org/mlr3spatial")
library(mlr3)
library(stars)
#> Loading required package: abind
#> Loading required package: sf
#> Linking to GEOS 3.9.1, GDAL 3.3.1, PROJ 8.1.0
stack_rasterbrick <- demo_stack_rasterbrick(size = 1, layers = 5)
stack_stars <- stars::st_as_stars(stack_rasterbrick)
backend_stars <- as_stars_backend(stack_stars, quiet = TRUE, response = "y.1", response_is_factor = TRUE)
task_stars <- as_task_classif(backend_stars, target = "y.1", positive = "1")
set.seed(42)
row_ids <- sample(1:task_stars$nrow, 500)
e1071svm <- e1071::svm(y.1 ~ ., task_stars$data(rows = row_ids))
predict(stack_stars, e1071svm)
#> prediction on the entire object failed; will try to split() bands over attributes
#> stars object with 2 dimensions and 1 attribute
#> attribute(s):
#> prediction
#> 1:28126
#> 0:21603
#> dimension(s):
#> from to offset delta refsys point values x/y
#> x 1 223 0 0.0044843 NA NA NULL [x]
#> y 1 223 1 -0.0044843 NA NA NULL [y]
Created on 2021-08-31 by the reprex package (v2.0.1)
This now gives the message
prediction on array(s) `x_1' failed; will try to split() dimension `band' over attributes
Thanks, this is more descriptive for users I'd say :)
Happy with the current setup, hence closing here.
AFAICS there is no way around to "split" a multiband raster, i.e. convert it to a data.frame, if one wants to predict on it - is that correct?
In
?predict.stars
I read:I've found this part hard to understand in the first place when getting started with {stars}. It became clearer after looking at the modelling vignette, especially at section "stars objects as data.frame" (just as a feedback).
Both
terra::predict()
andraster::predict()
seem to do this internally, i.e. if the predictors are stored in multiple bands, users can just pass the stack and they will be found/used.Is it likely that this will stay like this in the future? Asking because I am currently about to add spatial prediction support in {mlr3spatial}. I was wondering if potentially {stars} could detect a multiband raster and try to do the split internally - the model should then complain anyhow if some variables are missing when doing the final predict call.