Prediction on multiband rasters #448

pat-s commented 3 years ago

AFAICS there is no way around to "split" a multiband raster, i.e. convert it to a data.frame, if one wants to predict on it - is that correct?

In ?predict.stars I read:

separate predictors in object need to be separate attributes in object; in case they are e.g. in a band dimension, use 'split(object)'

I've found this part hard to understand in the first place when getting started with {stars}. It became clearer after looking at the modelling vignette, especially at section "stars objects as data.frame" (just as a feedback).

Both terra::predict() and raster::predict() seem to do this internally, i.e. if the predictors are stored in multiple bands, users can just pass the stack and they will be found/used.

Is it likely that this will stay like this in the future? Asking because I am currently about to add spatial prediction support in {mlr3spatial}. I was wondering if potentially {stars} could detect a multiband raster and try to do the split internally - the model should then complain anyhow if some variables are missing when doing the final predict call.

edzer commented 3 years ago

The terra data model is much simpler: it doesn't have attributes, and only copes with stacks of raster - whatever the stack "dimension" represents.

I'd like to make this simpler, but suppose you have a time series of spectral imagery, and time and spectral are both dimensions (meaning you have a single attribute, four-dimensional data cube) - how could a predict know what the different predictors are: the time instances, the spectral bands, or all unique combinations of time x band?

pat-s commented 3 years ago

Yes I see the issues.

I haven't yet dived into the multi-dim capabilities of {stars}, e.g. a variable with observations across multiple timestamps. For now, I was solely focusing on getting spatial-only prediction right. Maybe one option would be to let {stars} try some "smart" defaults, based on the dimension metadata information it has and report this during the call (something like "Detected XY, using Z as TZ" and so forth?

Some potentially helpful ideas:

I also see the advantages of unfolding the stars object into a data.frame first and then do predict.data.frame() and handing the variable curation over to the user. In {mlr3spatial} the return would be a {mlr3::Prediction} object and optionally a stars object written to disk - in {mlr3spatial} we anyhow do the prediction on a data.table internally, also for {terra} and {raster} objects. The use case for me only came up as I am about to benchmark predict() calls for various spatial packages and stumbled over the multi-band predict issue mentioned here :)

edzer commented 3 years ago

I agree that the simple case: a 3 D, single attr cube, where the third needs to be split for predict, could be a good default (with a message, and after trying & failing to call predict on that single attribute), and will look into that.

edzer commented 3 years ago

This should do it; would be great if you could check!

pat-s commented 3 years ago


In my reprex it does not work yet:

stack_rasterbrick <- demo_stack_rasterbrick(size = 1, layers = 5)
stack_stars <- stars::st_as_stars(stack_rasterbrick)
backend_stars <- as_stars_backend(stack_stars, quiet = TRUE, response = "y.1", response_is_factor = TRUE)
task_stars <- as_task_classif(backend_stars, target = "y.1", positive = "1")

row_ids <- sample(1:task_stars$nrow, 500)
e1071svm <- e1071::svm(y.1 ~ ., task_stars$data(rows = row_ids))

predict(e1071svm, stack_stars)
#> Error in eval(predvars, data, env): object 'x_2' not found

# works
stack_stars_df <- as.data.frame(split(stack_stars, "band"))
head(predict(e1071svm, stack_stars_df))
#> 1 2 3 4 5 6 
#> 1 1 1 1 1 1 
#> Levels: 1 0

edzer commented 3 years ago

That's because predict.stars seems to be never called - how would stars know it needs to call split() on the object?

> predict(e1071svm, stack_stars)
Error in eval(predvars, data, env) : object 'x_2' not found
> traceback()
8: eval(predvars, data, env)
7: eval(predvars, data, env)
6: model.frame.default(object, data, xlev = xlev)
5: model.frame(object, data, xlev = xlev)
4: model.matrix.default(delete.response(terms(object)), as.data.frame(newdata))
3: model.matrix(delete.response(terms(object)), as.data.frame(newdata))
2: predict.svm(e1071svm, stack_stars)
1: predict(e1071svm, stack_stars)
pat-s commented 3 years ago

Ah well, that was a dumb ordering mistake by me - ofc the stars object needs to be first in line.

Yet I get the following then

predict(stack_stars, e1071svm)
prediction on the entire object failed; will try to split() bands over attributes
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class ‘"try-error"’ to a data.frame
1: stop(gettextf("cannot coerce class %s to a data.frame", sQuote(deparse(class(x))[1L])), 2: as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
3: as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
4: data.frame(prediction = pr)
5: predict.stars(stack_stars, e1071svm)
6: predict(stack_stars, e1071svm)

Could it be that predict(split(object), model, ..., drop_dimensions = drop_dimensions) should be predict(split(object, "band"), model, ..., drop_dimensions = drop_dimensions) to make it work?

edzer commented 3 years ago

Ah, yes, recursion always needs care. This should do it - pls try.

pat-s commented 3 years ago

Looking good!

I am wondering in which case "prediction on the entire object" would succeed here. I got to admit that I am also having a bit of trouble with the terminology used in the message, especially "entire object". In this particular case I would assume that "the entire object" would make use of all attributes of the object and hence succeed - or would that mean here that only the first attribute would be used unless split() is applied? Maybe a refinement of the message using the terms "dimension", "attribute" and "columns" could help clarify from a users perspective?

stack_rasterbrick <- demo_stack_rasterbrick(size = 1, layers = 5)
stack_stars <- stars::st_as_stars(stack_rasterbrick)
backend_stars <- as_stars_backend(stack_stars, quiet = TRUE, response = "y.1", response_is_factor = TRUE)
task_stars <- as_task_classif(backend_stars, target = "y.1", positive = "1")

row_ids <- sample(1:task_stars$nrow, 500)
e1071svm <- e1071::svm(y.1 ~ ., task_stars$data(rows = row_ids))

predict(stack_stars, e1071svm)
#> prediction on the entire object failed; will try to split() bands over attributes
#> stars object with 2 dimensions and 1 attribute
#> attribute(s):
#>  prediction 
#>  1:28126    
#>  0:21603    
#> dimension(s):
#>   from  to offset      delta refsys point values x/y
#> x    1 223      0  0.0044843     NA    NA   NULL [x]
#> y    1 223      1 -0.0044843     NA    NA   NULL [y]

edzer commented 3 years ago

This now gives the message

prediction on array(s) `x_1' failed; will try to split() dimension `band' over attributes
pat-s commented 3 years ago

Thanks, this is more descriptive for users I'd say :)

Happy with the current setup, hence closing here.