Closed guo5hengg closed 3 years ago
Hi, welcome!
I am trying to be precise, so I can make sure that I understand the question correctly.
You are currently fitting 4 AutoTuner
s with a budget of n
.
For the ensemble, while you are spreading the budget over 3 methods, each method gets only 1/3 of the budget. In comparison, for the normal auto-tuners, each model gets the full budget. The comparison, therefore, is not completely fair. Additionally, the three models that yield the best ensemble do not necessarily have to be the best individual models!
What you ask is nonetheless possible, although a little hacky.
# .... continuing code above.
learners = list(ensemble_tuner)
design = benchmark_grid(task, learners, outer_hold) # this step runs tuning of the pipeline.
# Set store_mdoels to TRUE
bmr <- benchmark(design, store_models=TRUE)
# Extract the "learner pipeops" from the graph
ops = bmr$learners$learner[[1]]$learner$graph$pipeops[1:3]
# Resample only the outer fitting procedure (we do not tune again)
# Then convert to benchmark result and append to the bmr
lapply(ops, function(ll) {
res = resample(task, ll, bmr$resamplings$resampling[[1]])
bmr$combine(as_benchmark_result(res))
})
bmr
Hi @pfistfl, thanks for the reply.
I was not aware that the three models within the ensemble model are not getting the full budget of the input dataset, I am still trying to get used to the pipelines are was not made aware of them. I apologize in advance if this seems like a stupid question but how can I allocate the full budget for each of the 3 methods in the ensemble model above? Also, would it be made a fair comparison thereafter if I were to use the method you've just provided to compare the performance of the ensemble model against the 3 other isolated models ?
Thank you!
So for the ensemble AutoTuner you allocate a fixed budget of e.g. 300 seconds. It can therefore only spend 300 seconds in total on finding a good combination of hyperparameters for all 3 methods. But this will also not be fairly distributed, as different models take a different amount of time. A single model AutoTuner on the other hand can spend 300 on just one learner.
Also keep in mind:
Additionally, the three models that yield the best ensemble do not necessarily have to be the best individual models!
I am not sure what the question you are trying to answer:
Given a fixed budget, what is better: tuning a single model or tuning a full ensemble
? -> In this case, there is not really a way around re-fitting all models.
@pfistfl I understand now, Thanks for the help!! Your advise and comments are much appreciated!
Hi mlr3 community,
I'm trying to create a benchmark model to compare the results of glmnet, randomforest, logistic regression and then a combination of the three models through an ensemble model. Below is what I've got thus far:
Code
Initiate Task & Learner ----
task = TaskClassif$new(id = "timeseries",backend = data, target = "CHANGE_next", positive = '1')
Set date as a id
task$set_col_roles('date', role = 'name')
Initiate the three learner of the pipeline
glmnet_lrn = lrn("classif.glmnet", predict_type = "prob") logreg_lrn = lrn("classif.log_reg", predict_type = "prob") rf_lrn = lrn("classif.ranger", predict_type = "prob")
Specify alpha = 0.5 for glmnet model & tree = 500 for ranger model
glmnet_lrn$param_set$values$alpha = 0.5 rf_lrn$param_set$values$num.trees = 500
To create po object for ensemble model
glmnet_cv = PipeOpLearner$new(glmnet_lrn$clone(), id = "Multi") logreg_cv = PipeOpLearner$new(logreg_lrn$clone(), id = "LR") rf_cv = PipeOpLearner$new( rf_lrn$clone(), id = "RF")
Summarize to ensemble model
ensemble = gunion(list(glmnet_cv, logreg_cv, rf_cv)) %>>% PipeOpClassifAvg$new(innum = 3L)
ensemble$plot(html = FALSE)
ens_lrn = GraphLearner$new(ensemble) ens_lrn$predict_type = "prob"
Hyper Parameter Tuning Settings
Ensemble
ps_ens = ParamSet$new( list( ParamDbl$new("Multi.lambda.min.ratio", lower = 0.001, upper = 0.999), ParamDbl$new("LR.epsilon", lower = 0.001, upper = 0.999), ParamInt$new("RF.mtry", lower = 1L, upper = 5L)))
Glmnet
ps_glmnet = ParamSet$new( list( ParamDbl$new("lambda.min.ratio", lower = 0.001, upper = 0.999)))
Logreg
ps_logreg = ParamSet$new( list( ParamDbl$new("epsilon", lower = 0.001, upper = 0.999)))
RandomForest
ps_ranger = ParamSet$new( list( ParamInt$new("mtry", lower = 1L, upper = 5L)))
Specify modelling parameter
cv = rsmp("cv", folds = 10) measure = msr("classif.acc") terminator = trm("evals", n_evals = 1) #To limit running time, run each model once tuner = tnr('grid_search')
AutoTuner inclusive of in-sample cross validation within train set
ensemble_tuner = AutoTuner$new( learner = ens_lrn, resampling = cv, measure = measure, search_space = ps_ens, terminator = terminator, tuner = tuner) glmnet_tuner = AutoTuner$new( learner = glmnet_lrn, resampling = cv, measure = measure, search_space = ps_glmnet, terminator = terminator, # to limit running time tuner = tuner )
logreg_tuner = AutoTuner$new( learner = logreg_lrn, resampling = cv, measure = measure, search_space = ps_logreg, terminator = terminator, # to limit running time tuner = tuner )
rf_tuner = AutoTuner$new( learner = rf_lrn, resampling = cv, measure = measure, search_space = ps_ranger, terminator = terminator, # to limit running time tuner = tuner)
Specify holdout outer sampling on test set
outer_hold = rsmp("holdout")
Run Benchmark Model
learners = list(ensemble_tuner,glmnet_tuner, logreg_tuner,rf_tuner) design = benchmark_grid(task, learners, outer_hold) bmr <- benchmark(design)
As shown above, currently I'm running each model once for the ensemble model (3 models) then once each for the isolated comparison (3models). So in total, I'm running 6 models to get 4 resampling results. My question is, is it possible to branch out the train set of each learner in the ensemble model so that I only need to run 3 train learners instead of 6 learners?
Thank you for the help in advance!