Closed pabloprf closed 4 years ago
Hi Pablo,
We are planning to add more first-class support for Pareto front optimization to Ax and BoTorch. In the meantime, it is possible to implement basic Pareto front optimization via random scalarizations of the objective function. This can be done through composite objectives. Max can clarify on this more.
As a starting point, I would recommend with giving our higher level package, Ax. It supports constrained optimization where each outcome is modeled as a separate GP.
Ax by default uses Noisy EI, which supports batching (parallel evaluation) and multiple constraints. You can either set the noise to a small number or pass in None instead of a known SEM for the observation noise, in which case it will infer the noise. Even if your simulator is deterministic, not treating your simulator outputs as noiseless can be beneficial since it reduces degenerate behavior in cases where the model is misspecified (eg the data is not perfectly modeled by a stationary kernel). (See https://arxiv.org/pdf/1007.4580 )
Additional details about NEI can be found in https://arxiv.org/abs/1706.07094. In general we find that NEI with inferred it gives nice performance relative to EI even on deterministic functions.
Sent from my iPhone
On Dec 15, 2019, at 11:58 AM, Pablo Rodriguez notifications@github.com wrote:
Hello! I recently came across BoTorch and became very interested. I have been working for some years now with surrogate techniques such as Kriging to optimize expensive black-box functions. In my case, expensive simulations (which take days to complete, but that I can have several running in parallel).
Brief description of my problem
Multi-objective constrained optimization with 2 to 6 continuous variables. Possibility to run ~10 evaluations in parallel (great for multipoint acquisition functions). Noise-free evaluations because they come from a deterministic simulation. All objectives and constrains can be obtained from the same simulation results. Question
I have got BoTorch running for some test cases, but I was wondering what the best way is to implement the following features within your framework:
What is the correct way to implement the Gaussian Processes with deterministic noise-free evaluations? Should I use a FixedGaussianNoise likelihood with zero noise? Have you implemented Multiobjective acquisition functions that do not construct pseudo single-objective functions (i.e. not using weights). What I mean by this is that I'm interested in techniques that work with Pareto fronts, like acquisition functions that aim at increasing the volume of the Pareto set. What is the way to optimize and acquisition function subject to nonlinear constrains? As a first cut I could design a metric that accounts for "goodness" of a solution in terms of a meeting a constrain, and then use that metric in a multi-objective or weighted-objective optimization. However, maybe you have other suggestions. Thank you very much for the great package. Looking forward to contribute in the future.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi Eytan,
Thank you for your quick answer. I will play around with the random scalarizations as a proxy for fully Pareto front optimization in the meantime. Great to hear that there are plans for further multi-objective optimization capabilities. And thanks for indicating references to the concept of adding small noise to deterministic simulations to facilitate optimization. I will read about it. I'm not a computer scientist or ML expert, but eager to learn!
Re using random scalarizations: If you want to work with analytic acquisition functions (e.g. classic EI), you can use a ScalarizedObjective
along the following lines:
weights = torch.rand(m) # can use other randomization strategies here
obj = ScalarizedObjective(weights=weights)
acqf = ExpectedImprovement(model, best_f=best_val_observed, objective=obj)
To use it with the Monte-Carlo acquisition functions (such as NEI) you can do
weights = torch.rand(m) # can use other randomization strategies here
obj = LinearMCObjective(weights=weights)
acqf = qNoisyExpectedImprovement(model, objective=obj)
What is the correct way to implement the Gaussian Processes with deterministic noise-free evaluations? Should I use a FixedGaussianNoise likelihood with zero noise?
If you don't want to infer the noise you can use FixedNoiseGaussianLikelihood
, though I would suggest using a small value rather than zero. Otherwise you can easily run into numerical issues with the underlying stack (especially if there are repeated points, since there is no deduplication of points happening under the hood).
Great to hear that there are plans for further multi-objective optimization capabilities
Indeed, Pareto optimization is on on the roadmap. How many outcomes are you typically interested in?
Hi Max,
Thanks for your reply. I will try those scalarization techniques.
Indeed, Pareto optimization is on on the roadmap. How many outcomes are you typically interested in?
I'm currently interested in engineering problems where one can identify clear trade-offs, like performance v.s. cost, and where one seeks to find solutions that fall within robust operation scenarios for a given physical system. Thus I'm typically interested in problems with <4 outcomes.
I have been fairly successful in the implementation of this, but I have encountered these problems:
I'm working with a function with 2 variables and 2 outputs for benchmark. I fit a FixedNoiseGP
model to the data. If I create an acquisition function qExpectedImprovement
with LinearMCObjective
. If I manually set the weights to [1.0,0.0]
, the acquisition function should look the same as if I were to build it from the same problem but only 1 output (the first one). However, the acquisition functions look different. It is my understanding that FixedNoiseGP
should treat the outputs as independent variables (even if they come from the same training data).
Does anything come to mind that could be causing this discrepancy?
Another related question to this is... if I'm using qExpectedImprovement
, it expects a best_f
tensor. How should this be defined for a multi-objective problem? The dot product with the weights?
When using your fit_gpytorch_torch
routine, which uses ConvergenceCriterion
, the fitting process is stopped at different times if I work with the problem with 1 or 2 outputs. I'm not sure how it works, but the ftol
condition should be different in the two cases?
In other words, the model for independent output 1 (in a 2 output problem) turns out to be different than the model for a single output.
However, the acquisition functions look different.
What do you mean by that? The values returned are different? Note that qExpectedImprovement
is based on MC sampling, and so will only be identical if the underlying MCSampler
s use the same base samples. If they do, then the acquisition function using the [1.0, 0.0]
weights should be the same, so long as the model is the same. This goes to your second question - if you fit two models, one with one output and one with two outputs, then the fitting (which is non-deterministic) may result in slightly different models.
It is my understanding that FixedNoiseGP should treat the outputs as independent variables (even if they come from the same training data).
That is correct.
Another related question to this is... if I'm using qExpectedImprovement, it expects a best_f tensor. How should this be defined for a multi-objective problem? The dot product with the weights?
Yep, if you have a bunch of observed outcome pairs, then this would be the maximum over the dot products of the observation pairs.
the fitting process is stopped at different times if I work with the problem with 1 or 2 outputs. I'm not sure how it works, but the ftol condition should be different in the two cases?
There is randomness in the fitting process, and yes, the tolerance can also have an effect since now we're maximizing the sum of two MLLs. How different are the models? If the MLL is peaked (i.e. if there is a good amount of data with a reasonable amount of signal) then the difference in the models should be very small.
What do you mean by that? The values returned are different? Note that qExpectedImprovement is based on MC sampling, and so will only be identical if the underlying MCSamplers use the same base samples. If they do, then the acquisition function using the [1.0, 0.0] weights should be the same, so long as the model is the same. This goes to your second question - if you fit two models, one with one output and one with two outputs, then the fitting (which is non-deterministic) may result in slightly different models.
The values and shape (which I guess it what matters to select the next points) are different. I attach an example. (Apologies for the lack of colorbars and labels, they are just quick screenshots)
This is the mean of the posterior of the first output against the two variables (upper plot), followed by the evolution of the loss during the fitting process against iteration number (middle plot) and the acquisition function against the two variables and the q=9 next points to evaluate (bottom plot):
Now, this is the same kind of plot for the same problem but now removing the second output:
Because the weights for the first case are [1.0,0.0]
, I was expecting the acquisition function to look the same as the single-objective problem.
There is randomness in the fitting process, and yes, the tolerance can also have an effect since now we're maximizing the sum of two MLLs. How different are the models? If the MLL is peaked (i.e. if there is a good amount of data with a reasonable amount of signal) then the difference in the models should be very small.
I normalize both outputs to be between 0 and 1 before fitting and obtaining the acquisition function (although I de-normalize them in the upper plots of the examples above). They are not very different but the small discrepancies might be due, as you pointed out, to how the convergence criterion is met when you have the sum of two MLLs.
@Balandat, by using your suggestion of "maximum over the dot products of the observation pairs" to give a best_f
to the acquisition function, I get them to be very close to each other. Not quite the same, but very similar. Thanks for your help!
Great, glad you got it working.
To reiterate, there is randomness in the acquisition function if it's MC-based (which it will generally be if q>1). By default we draw the base samples for the evaluation once and them keep them fixed (see Appendix B of https://arxiv.org/abs/1910.06403 for more details on this). The discrepancy you see could come either from the variance in drawing the base samples, or slightly different hyperparameters of the fitted model. The latter you can check directly, the former you can evaluate by using custom MCSampler
- for the exact same model, you should see the acquisition functions converge (pointwise) in the number of mc samples.
Hello! I recently came across BoTorch and became very interested. I have been working for some years now with surrogate techniques such as Kriging to optimize expensive black-box functions. In my case, expensive simulations (which take days to complete, but that I can have several running in parallel).
Brief description of my problem
Question
I have got BoTorch running for some test cases, but I was wondering what the best way is to implement the following features within your framework:
Thank you very much for the great package. Looking forward to contribute in the future.