hyperband and non-linear runtime scale

berndbischl commented 4 years ago

i am pretty sure, all of the HB formulas dont hold anymore if the tuned algo does not scale linear in runtime with eta. this is especially true if we use the subsampling trick. we at least need to document this, but also maybe adapt a bit.

this is a more complicated issue, and needs to be discussed in the team of course, would be great if people already post observations and thoughts here

SebGGruber commented 4 years ago

Documenting appropriately is a no-brainer - I already did that today. But maybe give an additional argument specifying the runtime complexity of the learner w.r.t. the chosen budget in big-o notation and update the HB formula accordingly, so the first and the last bracket require approximately the same runtime again.

SebGGruber commented 4 years ago

I had some further thoughts about it and if we simply apply the inverse of the complexity on each budget, we should end up with the same budget across all brackets again. A simple example for O(f(budget)) with f(budget) = log(budget) :

The inverse then would be g := f^(-1) = exp

Assume we are using R = 2 and eta = 2, the brackets layout would look like this:

 bracket | bracket_stage | budget | n.configs | "real" runtime (ignoring constants) = f(budget)
    1             1           1          2          log(1) = 0
    1             2           2          1          log(2) = 0.693
    2             1           2          2          log(2) = 0.693

In this example, the usual hyperband algorithm always takes a lot longer to run (theoretically) in the last bracket compared to the first ones. How to fix this? Apply the inverse of the complexity function on each budget in each bracket stage:

 bracket | bracket_stage | budget = g(budget_old) | n.configs | "real" runtime (ignoring constants)
   1              1            exp(1) = 0               2           log(0) = 1
   1              2            exp(2) = 7.39            1           log(7.39) = 2
   2              1            exp(2) = 7.39            2           log(7.39) = 2

Now, the sum of the budgets is (approximately) equal across all brackets, again.

The only issue I can see here is, that the user specifies R = 2, but the algorithm then uses a budget of 7.39 for the last bracket stage. This is misleading? But at least the budgets are scaled correctly

SebGGruber commented 4 years ago

should work with the param trafo. I added an example to the "examples" section of the class doc. It's a dumb example, but it shows how it would work and I didnt know better

set.seed(123)
ps = ParamSet$new(list(
  ParamInt$new("nrounds", lower = 1, upper = 10, tags = "budget"),
  ParamDbl$new("eta",     lower = 0, upper = 1),
  ParamFct$new("booster", levels = c("gbtree", "gblinear", "dart"))
))

ps$trafo = function(x, param_set) {
  x$nrounds = round(log(x$nrounds)) + 1L
  return(x)
}

inst = TuningInstance$new(
   tsk("iris"),
   lrn("classif.xgboost"),
   rsmp("holdout"),
   msr("classif.ce"),
   ps,
   term("evals", n_evals = 100000)
)
tuner = TunerHyperband$new(eta = 2)
tuner$tune(inst)
inst$best()

juliambr commented 4 years ago

From a theoretical perspective, I think this issue is not a problem. The only assumption the hyperband paper makes is that the algorithm performance converges with a budget parameter going to infty.

"Adapting for convergence rates" is another topic, and is also given as outlook in the original hyperband paper. The issue can be closed IMO, but another interesting research question can be opened.

mlr-org / mlr3hyperband

hyperband and non-linear runtime scale #13