mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
140 stars 25 forks source link

Enquiries on branching of trained learners from ensemble model #557

Closed guo5hengg closed 3 years ago

guo5hengg commented 3 years ago

Hi mlr3 community,

I'm trying to create a benchmark model to compare the results of glmnet, randomforest, logistic regression and then a combination of the three models through an ensemble model. Below is what I've got thus far:

Code


Initiate Task & Learner ----

task = TaskClassif$new(id = "timeseries",backend = data, target = "CHANGE_next", positive = '1')

Set date as a id

task$set_col_roles('date', role = 'name')

Initiate the three learner of the pipeline

glmnet_lrn = lrn("classif.glmnet", predict_type = "prob") logreg_lrn = lrn("classif.log_reg", predict_type = "prob") rf_lrn = lrn("classif.ranger", predict_type = "prob")

Specify alpha = 0.5 for glmnet model & tree = 500 for ranger model

glmnet_lrn$param_set$values$alpha = 0.5 rf_lrn$param_set$values$num.trees = 500

To create po object for ensemble model

glmnet_cv = PipeOpLearner$new(glmnet_lrn$clone(), id = "Multi") logreg_cv = PipeOpLearner$new(logreg_lrn$clone(), id = "LR") rf_cv = PipeOpLearner$new( rf_lrn$clone(), id = "RF")

Summarize to ensemble model

ensemble = gunion(list(glmnet_cv, logreg_cv, rf_cv)) %>>% PipeOpClassifAvg$new(innum = 3L)

ensemble$plot(html = FALSE)

ens_lrn = GraphLearner$new(ensemble) ens_lrn$predict_type = "prob"

Hyper Parameter Tuning Settings

Ensemble

ps_ens = ParamSet$new( list( ParamDbl$new("Multi.lambda.min.ratio", lower = 0.001, upper = 0.999), ParamDbl$new("LR.epsilon", lower = 0.001, upper = 0.999), ParamInt$new("RF.mtry", lower = 1L, upper = 5L)))

Glmnet

ps_glmnet = ParamSet$new( list( ParamDbl$new("lambda.min.ratio", lower = 0.001, upper = 0.999)))

Logreg

ps_logreg = ParamSet$new( list( ParamDbl$new("epsilon", lower = 0.001, upper = 0.999)))

RandomForest

ps_ranger = ParamSet$new( list( ParamInt$new("mtry", lower = 1L, upper = 5L)))

Specify modelling parameter

cv = rsmp("cv", folds = 10) measure = msr("classif.acc") terminator = trm("evals", n_evals = 1) #To limit running time, run each model once tuner = tnr('grid_search')

AutoTuner inclusive of in-sample cross validation within train set

ensemble_tuner = AutoTuner$new( learner = ens_lrn, resampling = cv, measure = measure, search_space = ps_ens, terminator = terminator, tuner = tuner) glmnet_tuner = AutoTuner$new( learner = glmnet_lrn, resampling = cv, measure = measure, search_space = ps_glmnet, terminator = terminator, # to limit running time tuner = tuner )

logreg_tuner = AutoTuner$new( learner = logreg_lrn, resampling = cv, measure = measure, search_space = ps_logreg, terminator = terminator, # to limit running time tuner = tuner )

rf_tuner = AutoTuner$new( learner = rf_lrn, resampling = cv, measure = measure, search_space = ps_ranger, terminator = terminator, # to limit running time tuner = tuner)

Specify holdout outer sampling on test set

outer_hold = rsmp("holdout")

Run Benchmark Model

learners = list(ensemble_tuner,glmnet_tuner, logreg_tuner,rf_tuner) design = benchmark_grid(task, learners, outer_hold) bmr <- benchmark(design)


As shown above, currently I'm running each model once for the ensemble model (3 models) then once each for the isolated comparison (3models). So in total, I'm running 6 models to get 4 resampling results. My question is, is it possible to branch out the train set of each learner in the ensemble model so that I only need to run 3 train learners instead of 6 learners?

Thank you for the help in advance!

pfistfl commented 3 years ago

Hi, welcome!

I am trying to be precise, so I can make sure that I understand the question correctly. You are currently fitting 4 AutoTuners with a budget of n.

For the ensemble, while you are spreading the budget over 3 methods, each method gets only 1/3 of the budget. In comparison, for the normal auto-tuners, each model gets the full budget. The comparison, therefore, is not completely fair. Additionally, the three models that yield the best ensemble do not necessarily have to be the best individual models!

What you ask is nonetheless possible, although a little hacky.

# .... continuing code above.
learners = list(ensemble_tuner)
design = benchmark_grid(task, learners, outer_hold) # this step runs tuning of the pipeline.

# Set store_mdoels to TRUE
bmr <- benchmark(design, store_models=TRUE)

# Extract the "learner pipeops" from the graph
ops = bmr$learners$learner[[1]]$learner$graph$pipeops[1:3]
# Resample only the outer fitting procedure (we do not tune again)
# Then convert to benchmark result and append to the bmr
lapply(ops, function(ll) {
  res = resample(task, ll, bmr$resamplings$resampling[[1]])
  bmr$combine(as_benchmark_result(res))
})

bmr
guo5hengg commented 3 years ago

Hi @pfistfl, thanks for the reply.

I was not aware that the three models within the ensemble model are not getting the full budget of the input dataset, I am still trying to get used to the pipelines are was not made aware of them. I apologize in advance if this seems like a stupid question but how can I allocate the full budget for each of the 3 methods in the ensemble model above? Also, would it be made a fair comparison thereafter if I were to use the method you've just provided to compare the performance of the ensemble model against the 3 other isolated models ?

Thank you!

pfistfl commented 3 years ago

So for the ensemble AutoTuner you allocate a fixed budget of e.g. 300 seconds. It can therefore only spend 300 seconds in total on finding a good combination of hyperparameters for all 3 methods. But this will also not be fairly distributed, as different models take a different amount of time. A single model AutoTuner on the other hand can spend 300 on just one learner.

Also keep in mind:

Additionally, the three models that yield the best ensemble do not necessarily have to be the best individual models!

I am not sure what the question you are trying to answer:

Given a fixed budget, what is better: tuning a single model or tuning a full ensemble? -> In this case, there is not really a way around re-fitting all models.

guo5hengg commented 3 years ago

@pfistfl I understand now, Thanks for the help!! Your advise and comments are much appreciated!