mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 404 forks source link

Online learners #1572

Closed SteveBronder closed 7 years ago

SteveBronder commented 7 years ago

How would an online learner fit into the mlr resampling schema?

In the forecasting branch we have growing window cross validation, as exampled by the bottom of this graphic. Where the red is the training data and blue is the testing data.

image

In this schema, if a model can be updated with new data, then instead of retraining the entire model we simply shift the testing data forward and add the previous iterations testing data to the model.

I think this would lead to dramatic speedups for many models, though it would probably take a whole new resampling schema.

Does anyone have thoughts on how we could integrate this into mlr?

SteveBronder commented 7 years ago

I guess this is the main question:

Within the same resampling instance (i.e. going from the first window to the second window) is it possible to pass a previous iteration of a model into a new model with the current resampling framework?

It's not directly clear to me how this would work in mlr, but I certainly think it is worth thinking about. I believe there are plenty of learners where this could be used to speed up resampling dramatically

mb706 commented 7 years ago

Doesn't easily work with the current framework, since resample builds upon parallelMap which makes communication between resampling instances very hard ( -- especially if the parallelization level is "resample", as you would expect). You are better off writing a new function that does this imho. It would probably collect training and test set performances in a simple loop and call mergeResampleResult.

I'd wonder whether some measures and measure aggregates make assumptions about the training test size being equal between resampling instances, and have a different interpretation when used in this setting.

SteveBronder commented 7 years ago

One possible 'easy' way to do this would be to have two seperate resampling functions resample.online and resample.offline or something similar. resample.offline would be the normal one, but resample.online would be something like (pseudo-code)

resample.online  = function(...){
    # original train set and testing windows
    train_set, test_sets = makeResampleInstances(...) 
    # Do first window
    train.model = train(mod, train_set)
    pred.test = predict(train.model, test_sets[1])
    eval.test = performance(pred.test)
# update model based on test windows
    for (i in 1:length(test_sets)){
        update.model = updateTrain(mod, test_sets[i], train.model)
        pred.test = predict(update.model, test_sets[i+1])
        eval.test = performance(pred.test)
    }
}

Where each online learner would have an updateTrain function that is called. I want to look through our current learners and see how many models this would effect

berndbischl commented 7 years ago

it is definitely possible to "somehow" integrate this into mlr. but not without lots of work. i mean it is doable, but somebody would need to make this "his" project. if that person does not exist yet, this should be closed and mentioned and linked in the mlr extension wiki page IMHO.

SteveBronder commented 7 years ago

List of possible learners that could use this:

(all the forecasting learners)

  1. eta
  2. garch
  3. tbats
  4. bats
  5. arima
  6. auto.arima

(current learners)

  1. ada (possibly? using update.ada with a new x)
  2. bst (argument object to pass previous tree learners)
  3. cforest (argument object to pass previous tree learners)
  4. earth/mars
  5. gbm
  6. xgboost
  7. rpart (? model argument allows previous rparts to be passed to new models)
  8. mboost (possibly? they have an update function, but idt this is what they intended)

These are a handful that I saw looking around which have uses in both classification and regression. So in total about 20 learners. A number of other learners allow for the user to specify the starting parameters, which could be used in a similar fashion with an online model.

SteveBronder commented 7 years ago

@berndbischl is mlr doing the google summer of code again?

If so, I could be the guy to do this

berndbischl commented 7 years ago

is mlr doing the google summer of code again?

yes, this is the project: https://github.com/rstats-gsoc/gsoc2017/wiki/Operator-Based-Machine-Learning-Pipeline-Construction

the deadline for proposal topics is passed. are you interested in the above?

SteveBronder commented 7 years ago

Just realized I can't because I graduated in December :-/ Though that is a very cool project

berndbischl commented 7 years ago

what r u doing now?

SteveBronder commented 7 years ago

¯\(ツ)

Took time in January to see family. Now applying to jobs, trying to finish up a few papers, nothing too serious.

SteveBronder commented 7 years ago

@berndbischl here is one simple way to do this, but would require using R's lexical scoping and that the user does not have the paralell level at resample. Take for example the ets learner in forecast

trainLearner.fcregr.ets = function(.learner, .task, .subset, .weights = NULL, ...) {

  data = getTaskData(.task,.subset,target.extra = TRUE)
  data$target = ts(data$target, start = 1, frequency = .task$task.desc$frequency)
  forecast::ets(y = data$target, ...)
}

We know that within each resampling iteration we have a few objects that are passed to each iteration, say for example we take rin. Within the training function, after the initial run, we can save the model by adding something like

trainLearner.fcregr.ets = function(.learner, .task, .subset, .weights = NULL, ...) {

  data = getTaskData(.task,.subset,target.extra = TRUE)
  data$target = ts(data$target, start = 1, frequency = .task$task.desc$frequency)
  pred.type = getLearnerPredictType(.learner)
  # Checking if online model and whether first iteration
  if ( pred.type == "online" ){
    if (!is.null(rin$update.mod)) {
      mod = forecast::ets(y = data$target, model = rin$update.mod, ...)
    } else {
      mod = forecast::ets(y = data$target, ...)
      # IMP: Update to add model to rin
      rin$update.mod <<- mod
    }
  } else {
    mod = forecast::ets(y = data$target, ...)
  }
  return(mod)
}

Is that how lexical scoping would work here? I'm assuming, done sequentially, R will first pass rin$update.mod to the next iteration. So on the next iteration it has the updated model.

Though this is not great. For instance, this could cause problems when doing parallelization at the resampling level.