mlr-org / mlr3mbo

Flexible Bayesian Optimization in R
https://mlr3mbo.mlr-org.com
25 stars 1 forks source link

Accessing fitted GP from Optimisation #131

Closed dfalster closed 3 weeks ago

dfalster commented 10 months ago

Hi team, Thanks for your great work on the collection of packages. I am a research in Australia using GPs to assist with deterministic models of forest dynamics. I am interested in fitting GPs to fix maxima, using Bayesian optimisation.

I have a question about Bayesian Optimisation. Is it possible to access the fitted surrogate after optimisation step?

For example, in your simple-optimization-example on the readme, once we've finished the optimisation procedure, can we used the surrogate developed in the background? I have only been able to find the predicted optimum, and the archive of points where the objective function was evaluated.

For example, how would I compare a plot of the fitted GP to original function, with the optimum shown?

Similar, in the example of Single-Objective: 2D Schwefel Function. there are some plots in the example, but as far as I can tell, these use the original objective function not the fitted surrogate.

Many thanks for considering this!

Daniel Falster

sumny commented 10 months ago

hi @dfalster thanks for your interest and reaching out!

You can access the LearnerRegr within a Surrogate via the $learner field, for example:

library(bbotk)
library(mlr3mbo)
library(mlr3learners)
set.seed(1)

obfun = ObjectiveRFun$new(
  fun = function(xs) list(y1 = xs$x ^ 2),
  domain = ps(x = p_dbl(lower = -10, upper = 10)),
  codomain = ps(y1 = p_dbl(tags = "minimize")))

instance = OptimInstanceSingleCrit$new(
  objective = obfun,
  terminator = trm("evals", n_evals = 10))

surrogate = srlrn(lrn("regr.km", control = list(trace = FALSE)))
acqfun = acqf("ei")
acqopt = acqo(opt("random_search", batch_size = 100),
  terminator = trm("evals", n_evals = 100))

optimizer = opt("mbo",
  loop_function = bayesopt_ego,
  surrogate = surrogate,
  acq_function = acqfun,
  acq_optimizer = acqopt)

optimizer$optimize(instance)
surrogate
surrogate$learner
<SurrogateLearner>: LearnerRegrKM
* Parameters: assert_insample_perf=FALSE, catch_errors=TRUE
<LearnerRegrKM:regr.km>
* Model: km
* Parameters: control=<list>
* Packages: mlr3, mlr3learners, DiceKriging
* Predict Types:  response, [se]
* Feature Types: logical, integer, numeric
* Properties: -

If you already used the surrogate to do BO (as in the example above, the learner will be in the state of the latest surrogate fitting step)

surrogate$learner$model
Call:
DiceKriging::km(design = data, response = task$truth(), control = pv$control)

Trend  coeff.:
               Estimate
 (Intercept)   256.1655

Covar. type  : matern5_2
Covar. coeff.:
               Estimate
    theta(x)    30.0000

Variance estimate: 35104.96

Note that you can always also access the surrogate from the optimizer via the $surrogate field, e.g., we can also do:

optimizer$surrogate$learner$model

Now you can in principle either use the $learner within the surrogate or the surrogate class directly to make new predictions. For example:

library(data.table)
library(ggplot2)

xdt = data.table(x = seq(-10, 10, length.out = 101))
y1 = obfun$eval_dt(xdt)
surrogate_pred = surrogate$predict(xdt)
xdt[, "y1" := y1]
xdt[, c("mean_prediction", "se_prediction") := surrogate_pred]

# plot true function in black
# surrogate prediction (mean +- se in grey)
# known optimum in darkred
# found optimum in darkgreen
ggplot(aes(x = x, y = y1), data = xdt) +
  geom_line() +
  geom_line(aes(x = x, y = mean_prediction), colour = "grey",  data = xdt) +
  geom_ribbon(aes(x = x, ymin = mean_prediction - se_prediction, ymax = mean_prediction + se_prediction), fill = "grey", alpha = 0.2) +
  geom_point(aes(x = 0, y = 0), color = "darkred") +
  geom_point(aes(x = x, y = y1), data = instance$result, color = "darkgreen") +
  theme_minimal()

image

I hope this answered your questions!

(also: our book now contains a larger section about BO https://mlr3book.mlr-org.com/chapters/chapter5/advanced_tuning_methods_and_black_box_optimization.html#sec-bayesian-optimization which may be more detailed then the current vignettes - but I will update them soon)

dfalster commented 10 months ago

hi @sumny . Thanks so much for this through and helpful explanation. Makes sense. I had looked at the surrogate but hadn't realised it contained the updated GP. Makes sense.

I'd suggest updating your example to illustrate use of the surrogate after the optimisation is complete.

As a followup, the example focusses on optimisation. What if you just wanted to build a surrogate, e.g. with uncertainty below some threshold. This could then be used for optimisation, or other purses (e.g. visualisation of output response for a black box model. Would I do something different to optimisation to build this surrogate?

sumny commented 10 months ago

I'd suggest updating your example to illustrate use of the surrogate after the optimisation is complete.

Will do!

As a followup, the example focusses on optimisation. What if you just wanted to build a surrogate, e.g. with uncertainty below some threshold. This could then be used for optimisation, or other purses (e.g. visualisation of output response for a black box model. Would I do something different to optimisation to build this surrogate?

Assuming you just want a "good" surrogate model and build it iteratively, one way to do this would be to maybe change the acquisition function in your BO loop that focuses on building a "good" model instead of finding high promising candidate points for evaluation. If we agree that a "good" surrogate model is a model that has low uncertainty everywhere (or in certain specified regions) then our acquisition function (proposing the next candidate point for evaluation) could simply be the posterior predictive variance of the surrogate model - which we then want to maximize.

For example lets assume we have the following true black box function:

library(bbotk)
library(mlr3mbo)
library(mlr3learners)
library(data.table)
library(ggplot2)
set.seed(1)

sinus_1D = function(xs) 2 * xs$x * sin(14 * xs$x)

domain = ps(x = p_dbl(lower = 0, upper = 1))
codomain = ps(y = p_dbl(tags = "minimize"))
objective = ObjectiveRFun$new(sinus_1D,
  domain = domain, codomain = codomain)

xydt = generate_design_grid(domain, resolution = 1001)$data
xydt[, "y" := objective$eval_dt(xydt)$y]

ggplot(aes(x = x, y = y), data = xydt) +
  geom_line() +
  theme_minimal()

image

Now assume we have evaluated some initial design on the optim instance:

instance = OptimInstanceSingleCrit$new(objective, search_space = domain, terminator = trm("evals", n_evals = 20))
design = data.table(x = c(0.10, 0.31, 0.5, 0.59))
instance$eval_batch(design)

Then a GP fitted on this initial design will look like the following:

lrn_gp = lrn("regr.km", covtype = "matern5_2", optim.method = "BFGS", control = list(trace = FALSE))
surrogate = srlrn(lrn_gp, archive = instance$archive)
surrogate$update()
xydt[, c("mean", "se") :=  surrogate$predict(xydt[, "x"])]
ggplot(xydt, mapping = aes(x = x, y = y)) +
  geom_point(size = 2, data = instance$archive$data) +
  geom_line() +
  geom_line(aes(y = mean), colour = "gray", linetype = 2) +
  geom_ribbon(aes(min = mean - se, max = mean + se), fill = "gray", alpha = 0.3) +
  geom_line(aes(y = se), linewidth = 1, colour = "darkgray") +
  scale_y_continuous("y", sec.axis = sec_axis(~ ., name = "Posterior Standard Deviation")) +
  theme_minimal()

(Black line true function, surrogate mean prediction in dashed lightgrey and ribbons represent uncertainty around the mean prediction whereas the darkgrey line is simply the posterior predictive standard deviation).

image

A sensible strategy for choosing a next candidate point for evaluation could then simply be to find the point with the highest posterior predictive variance (i.e., highest surrogate model uncertainty) - because the model will have less uncertainty in that region after we will have evaluated that point.

In the Figure this would be around x = 1 (where the darkgrey line has a value of around 0.631.)

We would then evaluate this point and update the surrogate model.

instance$eval_batch(data.table(x = 1.0))
surrogate$update()
xydt[, c("mean", "se") :=  surrogate$predict(xydt[, "x"])]
ggplot(xydt, mapping = aes(x = x, y = y)) +
  geom_point(size = 2, data = instance$archive$data) +
  geom_line() +
  geom_line(aes(y = mean), colour = "gray", linetype = 2) +
  geom_ribbon(aes(min = mean - se, max = mean + se), fill = "gray", alpha = 0.3) +
  geom_line(aes(y = se), linewidth = 1, colour = "darkgray") +
  scale_y_continuous("y", sec.axis = sec_axis(~ ., name = "Posterior Standard Deviation")) +
  theme_minimal()

image

The next candidate based on the highest uncertainty would now probably be around x = 0.81.

So in essence we could just do "normal" BO but with a different acquisition function, because we mostly care about reducing the uncertainty of the model. To use the uncertainty of the surrogate model as acquisition function, we can use:

acq_function = acqf("sd")

Now we could run the "optimization" as we normally would (on a side note: probably we should use a more sophisticated acquisition function optimizer but for illustrative purposes we just skip it and use the cheap default):

optimizer = opt("mbo", loop_function = bayesopt_ego, surrogate = surrogate, acq_function = acq_function)
optimizer$optimize(instance)

The final archive then looks like:

               x             y  x_domain           timestamp batch_nr     acq_sd .already_evaluated
 1: 0.1000000000  1.970899e-01 <list[1]> 2023-11-01 12:02:27        1         NA                 NA
 2: 0.3100000000 -5.775057e-01 <list[1]> 2023-11-01 12:02:27        1         NA                 NA
 3: 0.5000000000  6.569866e-01 <list[1]> 2023-11-01 12:02:27        1         NA                 NA
 4: 0.5900000000  1.084067e+00 <list[1]> 2023-11-01 12:02:27        1         NA                 NA
 5: 1.0000000000  1.981215e+00 <list[1]> 2023-11-01 12:12:40        2         NA                 NA
 6: 0.8105923343 -1.521380e+00 <list[1]> 2023-11-01 12:23:43        3 0.52512337              FALSE
 7: 0.0002707751  2.052931e-06 <list[1]> 2023-11-01 12:23:45        4 1.11439774              FALSE
 8: 0.7006544482 -5.254747e-01 <list[1]> 2023-11-01 12:23:46        5 1.02513441              FALSE
 9: 0.2058567775  1.056835e-01 <list[1]> 2023-11-01 12:23:48        6 0.89366871              FALSE
10: 0.6837159549 -2.006213e-01 <list[1]> 2023-11-01 12:23:50        7 0.95964693              FALSE
11: 0.9104040253  3.246989e-01 <list[1]> 2023-11-01 12:23:52        8 0.49034916              FALSE
12: 0.4037153262 -4.764570e-01 <list[1]> 2023-11-01 12:23:54        9 0.38552672              FALSE
13: 0.0470159336  5.752043e-02 <list[1]> 2023-11-01 12:23:55       10 0.13293200              FALSE
14: 0.2578149722 -2.325176e-01 <list[1]> 2023-11-01 12:23:57       11 0.10138293              FALSE
15: 0.9586878328  1.447122e+00 <list[1]> 2023-11-01 12:23:59       12 0.08685817              FALSE
16: 0.4519821387  4.027150e-02 <list[1]> 2023-11-01 12:24:01       13 0.07853812              FALSE
17: 0.7623390332 -1.445916e+00 <list[1]> 2023-11-01 12:24:03       14 0.06997645              FALSE
18: 0.1527324831  2.575894e-01 <list[1]> 2023-11-01 12:24:04       15 0.06204140              FALSE
19: 0.8606257185 -8.516840e-01 <list[1]> 2023-11-01 12:24:06       16 0.04969792              FALSE
20: 0.5472359485  1.074219e+00 <list[1]> 2023-11-01 12:24:08       17 0.04314195              FALSE

The final surrogate then looks like:

surrogate$update()
xydt[, c("mean", "se") :=  surrogate$predict(xydt[, "x"])]
ggplot(xydt, mapping = aes(x = x, y = y)) +
  geom_point(size = 2, data = instance$archive$data) +
  geom_line() +
  geom_line(aes(y = mean), colour = "gray", linetype = 2) +
  geom_ribbon(aes(min = mean - se, max = mean + se), fill = "gray", alpha = 0.3) +
  theme_minimal()

image

Which is probably quite decent with respect to overall uncertainty.

However, note that we just ran the "optimization" for 20 iterations based on the terminator we specified when constructing the instance (trm("evals", n_evals = 20)).

You mentioned:

with uncertainty below some threshold

It probably would make sense to use a custom Terminator here but I am afraid I cannot directly provide one in mlr3mbo at the moment.

As a draft, you could look at the following:

library(R6)
library(bbotk)
library(checkmate)
library(mlr3misc)

TerminatorUncertaintyLevelReached= R6Class("TerminatorUncertaintyLevelReached",
  inherit = Terminator,
  public = list(

    #' @description
    #' Creates a new instance of this [R6][R6::R6Class] class.
    initialize = function() {
      param_set = ps(
        level = p_dbl(tags = "required", default = 0.1)
      )
      param_set$values = list(level = 0.1)
      super$initialize(
        id = "uncertainty_level_reached",
        param_set = param_set,
        properties = "single-crit",
        label = "Uncertainty Level",
        man = "mlr_terminators_uncertainty_level_reached"
      )
    },

    #' @description
    #' Is `TRUE` iff the termination criterion is positive, and `FALSE`
    #' otherwise.
    #'
    #' @return `logical(1)`.
    is_terminated = function(archive) {
      assert_r6(archive, "Archive")
      level = self$param_set$values$level

      # quite hacky
      if (archive$n_evals == 0L | "acq_sd" %nin% colnames(archive$data)) {
        return(FALSE)
      }

      sddata = archive$data[["acq_sd"]]
      any(sddata <= level, na.rm = TRUE)
    }
  )
)

This terminator could then be used within the instance:

terminator = TerminatorUncertaintyLevelReached$new()
terminator$param_set$values$level = 0.01  # specify threshold
instance = OptimInstanceSingleCrit$new(objective, search_space = domain, terminator = terminator)
optimizer$optimize(instance)

and now, optimization continues as long as acq_sd is above the given 0.01 level.

dfalster commented 10 months ago

hi @sumny , Thanks you so much for this. I really appreciate the time you spent on this post, thanks for your generosity.

BTW - Your example could also easily be added as a chapter in the book :).

Your suggestion to change the acquisition function and continue using bayesopt is perfect. I tried this, and as expected, the algorithm spends more time resolving across the domain, instead of the maximum, as shown in the plots below.

In both cases I seeded with an initial design or 11 points, then let the bayesopt algorithm choose the next 9 (so terminating with 20 evaluations). I love that a single line changes the behaviour.

optimizer <- opt("mbo",
  loop_function = bayesopt_ego,
  surrogate = srlrn(lrn("regr.km", covtype = "matern3_2", optim.method = "BFGS",control = list(trace = FALSE))),
  acq_function = acqf("sd"),
  acq_optimizer = acqo(opt("nloptr", algorithm = "NLOPT_GN_ORIG_DIRECT"),
    terminator = trm("stagnation", iters = 100, threshold = 1e-5))
  )

obfun <- ObjectiveRFun$new(
  fun = function(xs) list(y = f(xs$x)),
  domain = ps(x = p_dbl(lower = log(0.01), upper = log(2))),
  codomain = ps(y = p_dbl(tags = "maximize"))
)

initial_design <- data.table(x = sort(c(log(p$strategies[[1]]$lma), seq_range(log(c(0.01, 2)), 10))))

instance <- OptimInstanceSingleCrit$new(
  objective = obfun,
  terminator = trm("evals", n_evals = 20)
)
instance$eval_batch(initial_design)

optimizer$optimize(instance)

# make predictions
xdt <- data.table(x = seq_range(log(c(0.01, 2)), length.out = 101))
surrogate_pred <- optimizer$surrogate$predict(xdt)
xdt[, c("y", "se") := surrogate_pred]

# plot true function in red, surrogate prediction (mean +- se in black/ grey)
instance$archive$data %>%
  as_tibble() %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_line(data = tibble(x = log(x), y), col = "red") +
  geom_line(data = xdt, col = "black") +
  geom_ribbon(data = xdt, aes(x = x, ymin = y - se, ymax = y + se), fill = "grey", alpha = 0.2) +
  theme_minimal()

Using acqf("ei") - focusses on maximum

(red is true function, black is surrogate)

Screenshot 2023-11-02 at 5 30 23 am

Using acqf("sd") - focusses on reducing sd everywhere

Screenshot 2023-11-02 at 5 30 27 am

Success!!

Odd behaviour and a solution

However, you'll notice an oddity in the second plot. I'll explain it here as it might be a small bug.

Strangely, theres a point near the top left of the broad ridge of the second plot, where the gp has diverged a long way from the point. I've seen this a few times now. I thought that was odd behaviour, given it's mean to be an interpolator. I've been trying to figure out what's going on and think I've reached a conclusion.

to verify, I tried refitting the learner to the exact same data and interestingly it no longer diverges.

The code and plot below follows on from the code above, so we use the same data collected in the optimisation and refit a new learner of exact same type, then plot. Black is the original learner showing the divergence. Green is the new learner., now going through the point (in blue), as expected.


 new_surrogate <- srlrn(lrn("regr.km", covtype = "matern3_2", optim.method = "BFGS", control = list(trace = FALSE)), archive = instance$archive)

 new_surrogate$update()
 new_surrogate$learner$model

 new_surrogate_pred <- new_surrogate$predict(xdt)
 xdt[, c("y2", "se") := new_surrogate_pred]

instance$archive$data %>%
  as_tibble() %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_line(data = tibble(x = log(x), y), col = "red") +
  geom_line(data = xdt, col = "black") +
  geom_line(data = xdt, aes(y=y2), col = "green")  +
  geom_point(data = instance$archive$data %>% as_tibble() %>% filter(batch_nr==10), col="blue") +
  theme_minimal()
Screenshot 2023-11-02 at 6 34 19 am

It turns out, the blue point was the last one added (i.e. highest batch_nr). So after the optimisation, is it possible the learner hasn't been updated to reflect the last point added?

To check this I added an extra call to optimizer$surrogate$update() after the optimisation step, and voila, problem gone.

Screenshot 2023-11-02 at 6 38 38 am

I do think it's dangerous to rely on the user to run the last update step, so suggest this call should be within the optimisation loop, prior to completion. Otherwise, the learner is out of sync with the data.

sumny commented 10 months ago

@dfalster thanks for the follow up!

You are correct - the surrogate used to only be updated prior to a new candidate point being searched for and therefore after the optimization process has terminated, the surrogate is no longer "up to date" with all data logged in the archive of the instance.

I now changed this behavior, see #132 to try a final update step of the surrogate after the optimization process has terminated (although this is not always meaningful, e.g., when using ParEGO for multi-objective optimization). In general, the surrogate state should now be more consistent with the expectation of the user - thanks for bringing this up.

dfalster commented 10 months ago

Super!!!