pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.06k stars 390 forks source link

[Feature Request] T-batch acquisition optimization #2528

Closed yunshengtian closed 2 days ago

yunshengtian commented 6 days ago

šŸš€ Feature Request

A complete t-batch support (especially t-batch acquisition optimization) for parallel optimization with different experimental setups, if this is a good idea.

Motivation

The problem that I am targeting is the parallel evaluation of BO algorithms. In BO research, we typically have to evaluate multiple BO algorithms on multiple problems with multiple random seeds, which can be a pain for large-scale evaluations (say, we have 100 problems and 100 seeds) if naively parallelizing on CPUs. So I wonder how well this can potentially be done in a batched fashion on a GPU to speed up evaluations significantly.

Pitch

Describe the solution you'd like

In my understanding (correct me if I am wrong), the support for t-batch exists for fitting surrogate models and evaluating acquisition functions, but I am not sure if optimizing acquisition functions with t-batch is well-supported.

To do this, I assume we need to generate t-batch initial conditions (I saw a TODO here) and then use gen_candidates_torch to parallelize optimization on a GPU. Also I guess we need to separate the dimension of random restart and the t-batch dimension during the optimization so the input shape is probably like n_restart x batch_shape x q x d?

Describe alternatives you've considered

I currently can't come up with an alternative to this unless we do naive process-level parallelization instead of t-batch.

Are you willing to open a pull request? (See CONTRIBUTING)

In the near future, I probably won't have enough capacity for a pull request since I'm not an expert in the codebase. Also I am not sure how reasonable or how challenging this could be. But I am happy to discuss and understand more what could be the best solution here, and see if I can contribute anything. Thank you so much!

Balandat commented 5 days ago

Hmm that's an interesting idea. Technically I don't think there would be much in the way of doing something like this. The main challenge I see with that is that you also need an optimization algorithm that can handle this batched setup well.

In BoTorch we usually use the L-BFGS-B implementation in scipy for acquisition function optimization - the fact that we use Sample Average Approximation makes our acquisition functions (even the MC ones) deterministic, so we can benefit from the faster convergence of quasi-second order optimization algorithms. L-BFGS-B implementations (incl. the one in scipy) don't support batching, so you can either optimize each batch of candidates individually (which would forgo much of the benefits of your proposal) or you can solve a higher-dimensional optimization problem by stacking across batches. We do some of this in our optimization, but you have to be careful as this makes the optimization problem higher-dimensional and harder to solve. So there is a tradeoff between benefiting from the batched evaluation of the acquisition functions and the either longer optimization time and being more likely to get suboptimal solutions.

Similar considerations apply to first-order algorithms, but to a smaller extent - so if you want to use, say, Adam in torch to optimize the batched acquisition function that could work - you'd still have to stack the dimensions though since AFAIK neither of the algorithms in torch.optim are set up for solving batches of problems. I would be worried about situations in which the acquisition and gradient values are on different scales for the different batches in a way that confuses the optimizer, so that you may end up in a situation where you optimize some batches well and some batches not.

I think if you're interested in the relative performance of different algorithms then this could still work - all algorithms optimized in that way would be subject to these issues. But it would likely skew the absolute performance as the optimization quality will be inferior to how you would use it in an actual problem (in a non-batched fashion). This would likely result in systematically underestimating the performance of the algorithms run in that way (and so the gap to say a random exploration baseline would be smaller than it actually is).

In the past we had actually been working on a batch-aware implementation of L-BFGS-B we call "pLBFGS" (for "parallelized LBFGS") that would avoid this issue. That is itself a nontrivial research problem and we haven't been able to prioritize this. We'd love to get some external help on this :)

yunshengtian commented 2 days ago

Hi @Balandat thank you so much for such a detailed response! I appreciate your insights into this problem.

In my current implementation, I tried using Adam with stacked dimensions which empirically works worse than scipy L-BFGS-B. This makes sense as you explained, objective/gradient scales can be annoying, and a unified set of optimizer's hyperparameters can be inflexible. But maybe as the first step, parallelizing just on the same problem with different approaches/seeds is more useful and may have fewer issues of scales. But anyway, indeed it seems like there is no off-the-shelf solution that works well and research effort is needed. The pLBFGS effort sounds interesting - I apologize for not being able to help much right now but really look forward to it if there is progress by any chance :)

I guess this issue can be more important when we really want to scale up the evaluation on a huge collection of benchmarks with an affordable cost, which IMHO is an important issue for reliable and accessible BO research. But I certainly understand there are many more pressing issues than this for your team to focus on. Thanks again for your input!