mlr-org / mlr3mbo

Flexible Bayesian Optimization in R
https://mlr3mbo.mlr-org.com
25 stars 1 forks source link

Missing values in acq_ei #144

Closed hududed closed 4 months ago

hududed commented 4 months ago

So when I run this code:

library(mlr3mbo)
library(mlr3)
library(mlr3learners)
library(bbotk)
library(data.table)
library(tibble)

    data = data.table(
      Power = c(45, 14, 66, 12, 23, 40, 56, 64, 48),
      Speed = c(49, 33, 30, 22, 46, 15, 20, 25, 12),
      DPI = c(5, 5, 7, 3, 2, 5, 6, 3, 5),
      N2gas = c(1, 0, 0, 1, 0, 1, 0, 0, 1),
      Defocus = c(-0.2, 0.2, 0.1, 0.1, -0.1, 0.1, 0, 0, 0.2),
      Resistance = c(5000000, 5000000, 5000000, 5000000, 5000000, 12.1, 3.7, 13.9, 4.6)
    )
    domain = ps(Power = p_int(lower = 10, upper = 70),
                Speed = p_int(lower = 10, upper = 60),
                DPI = p_int(lower = 1, upper = 7),
                N2gas = p_int(lower = 0, upper = 1),
                Defocus = p_dbl(lower = -0.3, upper = 0.3))
    codomain = ps(Resistance = p_dbl(tags = "minimize"))

    archive = Archive$new(search_space = domain, codomain = codomain)
    archive$add_evals(xdt = data[, c("Power", "Speed", "DPI", "N2gas", "Defocus")], ydt = data[, c("Resistance")])
    surrogate <- srlrn(lrn("regr.ranger"), archive = archive)
    acq_function = acqf("ei", surrogate = surrogate)
    acq_optimizer = acqo(
      opt("focus_search", n_points = 1000, maxit = 10),
      terminator = trm("evals", n_evals = 11000),
      acq_function = acq_function)

    set.seed(42)
    acq_function$surrogate$update()
    acq_function$update()

    candidate = acq_optimizer$optimize()

I get this error:

>     candidate = acq_optimizer$optimize()
WARN  [08:49:39.398] [bbotk] Assertion on 'ydt[, self$cols_y, with = FALSE]' failed: Contains missing values (column 'acq_ei', row 369).
Error: Assertion on 'ydt[, self$cols_y, with = FALSE]' failed: Contains missing values (column 'acq_ei', row 369).
In addition: Warning messages:
1: In value[[3L]](cond) : Calibration failed with error:
Error in approx(x = calib.x, y = calib.y, xout = vars): need at least two non-NA values to interpolate
Falling back to non-calibrated variance estimates.
2: In sqrt(infjack$var.hat) : NaNs produced

I am running mlr3mbo 0.2.2.

sumny commented 4 months ago

@hududed per default lrn("regr.ranger") will estimate standard errors via the infinitesimal jacknife (se.method = "infjack") this can cause problems in the case of predicting only few (i.e., a single) point due to no correction being applicable (see, ?predict.ranger). As OptimizerFocusSearch might evaluate points in very small batches during acquisition function optimization this can cause errors. Ideally you use the default random forest mlr3mbo provides default_rf() which specifies ranger to use the jackknife-after-bootstrap method for the estimation of standard errors.


library(mlr3mbo)
library(mlr3)
library(mlr3learners)
library(bbotk)
library(data.table)
library(tibble)

data = data.table(
  Power = c(45, 14, 66, 12, 23, 40, 56, 64, 48),
  Speed = c(49, 33, 30, 22, 46, 15, 20, 25, 12),
  DPI = c(5, 5, 7, 3, 2, 5, 6, 3, 5),
  N2gas = c(1, 0, 0, 1, 0, 1, 0, 0, 1),
  Defocus = c(-0.2, 0.2, 0.1, 0.1, -0.1, 0.1, 0, 0, 0.2),
  Resistance = c(5000000, 5000000, 5000000, 5000000, 5000000, 12.1, 3.7, 13.9, 4.6)
)
domain = ps(Power = p_int(lower = 10, upper = 70),
  Speed = p_int(lower = 10, upper = 60),
  DPI = p_int(lower = 1, upper = 7),
  N2gas = p_int(lower = 0, upper = 1),
  Defocus = p_dbl(lower = -0.3, upper = 0.3))
codomain = ps(Resistance = p_dbl(tags = "minimize"))

archive = Archive$new(search_space = domain, codomain = codomain)
archive$add_evals(xdt = data[, c("Power", "Speed", "DPI", "N2gas", "Defocus")], ydt = data[, c("Resistance")])

###
surrogate = srlrn(default_rf(), archive = archive)
###

acq_function = acqf("ei", surrogate = surrogate)
acq_optimizer = acqo(
  opt("focus_search", n_points = 1000, maxit = 10),
  terminator = trm("evals", n_evals = 11000),
  acq_function = acq_function)

set.seed(42)
acq_function$surrogate$update()
acq_function$update()

candidate = acq_optimizer$optimize()
> candidate
   Power Speed   DPI N2gas     Defocus  x_domain   acq_ei .already_evaluated
   <int> <int> <int> <int>       <num>    <list>    <num>             <lgcl>
1:    52    21     5     0 -0.04136798 <list[5]> 165035.6              FALSE

in your case of a standard numeric / integer search space you might also want to use a GP as surrogate (default_gp()). Hope this helps.

hududed commented 4 months ago

Ah ok that solved it thanks. Is default_gp and default_rf the preferred learners? I didnt see any more than those two.

sumny commented 4 months ago

This a currently the two default learners used as surrogates, see ?default_surrogate:

     For numeric-only (including integers) parameter spaces without any
     dependencies a Gaussian Process is constricted via ‘default_gp()’.
     For mixed numeric-categorical parameter spaces, or spaces with
     conditional parameters a random forest is constructed via
     ‘default_rf()’.

But this also might be extended in the future!