pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.11k stars 406 forks source link

Add ScaleKernel to get_covar_module_with_dim_scaled_prior #2619

Open dai08srhg opened 2 weeks ago

dai08srhg commented 2 weeks ago

Motivation

Since version 0.12.0, dim_scaled_lognormal_prior[Hvarfner2024vanilla] has become the default. However, as ScaleKernel is not applied in get_covar_module_with_dim_scaled_prior(), performance may deteriorate in some cases. (The selection of the prior distribution depends on the task, but adjusting the scale seems beneficial and unproblematic across tasks.)

An example is shown below. Run the task of finding the minimum of Styblinski-Tang (D=40) three times and compare the average performance. StyblinskiTang40_performance-1 StyblinskiTang40_all-1

Have you read the Contributing Guidelines on pull requests?

I have read it.

Test Plan

Since this is a performance-related change, testing will involve a performance comparison on benchmark functions (such as Styblinski-Tang).

Related PRs

(If this PR adds or changes functionality, please take some time to update the docs at https://github.com/pytorch/botorch, and link to your PR here.)

Balandat commented 2 weeks ago

@hvarfner you have done some evaluations here - any thoughts? My understanding was that we didn't really see any benefit from using a scale kernel across a variety of functions if we standardize the outcomes.

hvarfner commented 1 week ago

Hi @dai08srhg ,

Thanks for checking this out, and sorry for the late reply! When the new priors were implemented, the ScaleKernel was dropped after a lot of ablation. In fact, there are some cases where the outputscale is actually problematic.

The results in the paper (e.g. Fig. 19-22) shows that the performance was frequently substantially worse with a ScaleKernel. For high-dimensional problems, the outputscale parameter tends to shrink quite rapidly, leading to very local behavior - where worse perormance may follow. On some internal testing on mid- and high-dimensional problems, the performance with the inclusion of a ScaleKernel was generally not better, either.

Now, the shrinkage does not always happen, and is not always bad for performance. I re-ran your specific experiment, and also noticed that the ScaleKernel inclusion was slightly better --> stybtang40.pdf, but the difference I saw was not as stark (10 runs).

With that said, I think the effect that the outputscale (whether learned or not, when it shrinks and whether that is good or bad) is very interesting and would like to understand it better. However, we concluded that the inclusion of a ScaleKernel does more harm than good - both in terms of regret performance and the exploration-exploitation trade-off.