Closed SteveBronder closed 7 years ago
I guess this is the main question:
Within the same resampling instance (i.e. going from the first window to the second window) is it possible to pass a previous iteration of a model into a new model with the current resampling framework?
It's not directly clear to me how this would work in mlr, but I certainly think it is worth thinking about. I believe there are plenty of learners where this could be used to speed up resampling dramatically
Doesn't easily work with the current framework, since resample
builds upon parallelMap
which makes communication between resampling instances very hard ( -- especially if the parallelization level is "resample", as you would expect). You are better off writing a new function that does this imho. It would probably collect training and test set performances in a simple loop and call mergeResampleResult
.
I'd wonder whether some measures and measure aggregates make assumptions about the training test size being equal between resampling instances, and have a different interpretation when used in this setting.
One possible 'easy' way to do this would be to have two seperate resampling functions resample.online
and resample.offline
or something similar. resample.offline
would be the normal one, but resample.online
would be something like
(pseudo-code)
resample.online = function(...){
# original train set and testing windows
train_set, test_sets = makeResampleInstances(...)
# Do first window
train.model = train(mod, train_set)
pred.test = predict(train.model, test_sets[1])
eval.test = performance(pred.test)
# update model based on test windows
for (i in 1:length(test_sets)){
update.model = updateTrain(mod, test_sets[i], train.model)
pred.test = predict(update.model, test_sets[i+1])
eval.test = performance(pred.test)
}
}
Where each online learner would have an updateTrain function that is called. I want to look through our current learners and see how many models this would effect
it is definitely possible to "somehow" integrate this into mlr. but not without lots of work. i mean it is doable, but somebody would need to make this "his" project. if that person does not exist yet, this should be closed and mentioned and linked in the mlr extension wiki page IMHO.
List of possible learners that could use this:
(all the forecasting learners)
(current learners)
object
to pass previous tree learners)object
to pass previous tree learners)model
argument allows previous rparts to be passed to new models)These are a handful that I saw looking around which have uses in both classification and regression. So in total about 20 learners. A number of other learners allow for the user to specify the starting parameters, which could be used in a similar fashion with an online model.
@berndbischl is mlr doing the google summer of code again?
If so, I could be the guy to do this
is mlr doing the google summer of code again?
yes, this is the project: https://github.com/rstats-gsoc/gsoc2017/wiki/Operator-Based-Machine-Learning-Pipeline-Construction
the deadline for proposal topics is passed. are you interested in the above?
Just realized I can't because I graduated in December :-/ Though that is a very cool project
what r u doing now?
¯\(ツ)/¯
Took time in January to see family. Now applying to jobs, trying to finish up a few papers, nothing too serious.
@berndbischl here is one simple way to do this, but would require using R's lexical scoping and that the user does not have the paralell level at resample. Take for example the ets learner in forecast
trainLearner.fcregr.ets = function(.learner, .task, .subset, .weights = NULL, ...) {
data = getTaskData(.task,.subset,target.extra = TRUE)
data$target = ts(data$target, start = 1, frequency = .task$task.desc$frequency)
forecast::ets(y = data$target, ...)
}
We know that within each resampling iteration we have a few objects that are passed to each iteration, say for example we take rin
. Within the training function, after the initial run, we can save the model by adding something like
trainLearner.fcregr.ets = function(.learner, .task, .subset, .weights = NULL, ...) {
data = getTaskData(.task,.subset,target.extra = TRUE)
data$target = ts(data$target, start = 1, frequency = .task$task.desc$frequency)
pred.type = getLearnerPredictType(.learner)
# Checking if online model and whether first iteration
if ( pred.type == "online" ){
if (!is.null(rin$update.mod)) {
mod = forecast::ets(y = data$target, model = rin$update.mod, ...)
} else {
mod = forecast::ets(y = data$target, ...)
# IMP: Update to add model to rin
rin$update.mod <<- mod
}
} else {
mod = forecast::ets(y = data$target, ...)
}
return(mod)
}
Is that how lexical scoping would work here? I'm assuming, done sequentially, R will first pass rin$update.mod
to the next iteration. So on the next iteration it has the updated model.
Though this is not great. For instance, this could cause problems when doing parallelization at the resampling level.
How would an online learner fit into the mlr resampling schema?
In the forecasting branch we have growing window cross validation, as exampled by the bottom of this graphic. Where the red is the training data and blue is the testing data.
In this schema, if a model can be updated with new data, then instead of retraining the entire model we simply shift the testing data forward and add the previous iterations testing data to the model.
I think this would lead to dramatic speedups for many models, though it would probably take a whole new resampling schema.
Does anyone have thoughts on how we could integrate this into mlr?