Closed r-ashwin closed 3 years ago
Yeah so the issue is here: https://github.com/pytorch/botorch/blob/0357aa57822ebb3b5c37b6eaceda21b5dd548bb9/botorch/optim/optimize.py#L145-L149
Essentially, by wrapping the MFKG inside the FIxedFeatureAcquisitionFunction
, that instance check fails and we don't create initial conditions of the proper size.
I'd like to better understand what you're trying to achieve here. It looks like you're fixing the fidelity feature in your input. Are you essentially trying to find the best configuration while constraining your fidelity to a particular value?
As a workaround, you can circumvent this by passing in appropriate batch_initial_conditions
as follows:
from botorch.optim.initializers import gen_one_shot_kg_initial_conditions
batch_initial_conditions = gen_one_shot_kg_initial_conditions(
acq_function=qKG,
bounds=bounds,
q=1,
num_restarts=5,
raw_samples=25,
)
# throw away the fidelity dimension
ff_batch_initial_conditions = batch_initial_conditions[..., :-1]
with manual_seed(1234):
candidates, acq_value = optimize_acqf(
acq_function=qKG_FF,
bounds=ff_bounds,
q=1,
num_restarts=5,
raw_samples=25,
batch_initial_conditions=ff_batch_initial_conditions, # make sure to pass this
)
cc @saitcakmak
I believe this issue is the same issue here: https://github.com/pytorch/botorch/issues/578#issuecomment-718344297
(and the following comment for workaround, passing q=num_fantasies + actual_q
to optimize_acqf
, where actual_q
is the one you intended to use)
Like @Balandat points out, this is due to wrapping a OneShotAcquisitionFunction
within FixedFeatureAcquisitionFunction
. I think passing the initial conditions explicitly is the best option here.
@Balandat What would you think of a fix like this:
Define (with a better name)ffos = isinstance(acq_function, FixedFeatureAcquisitionFunction) and isinstance(acq_function.acq_function, qKnowledgeGradient)
,
Replace if isinstance(acq_function, qKnowledgeGradient)
with if isinstance(acq_function, qKnowledgeGradient) or ffos
and batch_initial_conditions = ic_gen( acq_function=acq_function,
with acq_function=acq_function.acq_function if ffos else acq_function
.
We can also add a post processing step to remove the feature dimensions from the initial conditions.
I guess a cleaner way to implement this is to define a gen_fixed_feature_initial_conditions
that performs these steps
We can also add a post processing step to remove the feature dimensions from the initial conditions.
One issue with this is that if we generate a low discrepancy Sobol sequence in higher dimensions and then chop off a dimension then that destroys the discrepancy properties.
I guess a cleaner way to implement this is to define a gen_fixed_feature_initial_conditions that performs these steps
Yeah that makes sense. We should probably have a helper that just takes the original bounds and the FixedFeatureAcquisitionFunction
and then generates ICs under the hood using the information in the _selector
of the FixedFeatureAcquisitionFunction
.
if we generate a low discrepancy Sobol sequence in higher dimensions and then chop off a dimension then that destroys the discrepancy properties.
This makes sense when we increase the dimension (e.g., #504), but it is counterintuitive when we reduce the dimension. I remember from the discussion in https://github.com/scipy/scipy/pull/10844 that there are some quirks to using QMC. I took a second look now but can't find anything about chopping off dimensions there. I dug a bit deeper and it seems that chopping off dimensions is safe, see below for discussion and reference.
TLDR: Chopping off dimensions preserves the digital net property.
Going with the notation in https://arxiv.org/pdf/2008.08051.pdf, Sobol is a (t, d)-sequence, i.e., first 2^m
points of Sobol generates a (t, m, d)-net in base 2. This is saying that every elementary interval of volume 2^{t-m}
contains 2^t
points.
An elementary interval in base 2 is given by E[k, c] = \prod_{j=1}^{d} [ c_j / 2^{k_j}, c_j + 1 / 2^{k_j})
, for c_j, k_j
integers. If I chop off a dimension (wlog, say last dimension), I know that k_d <= t-m
. If k_d = 1
, the d-1
-dim elementary interval is E[k, c] = \prod_{j=1}^{d-1} [ c_j / 2^{k_j}, c_j + 1 / 2^{k_j})
, has volume 2^{t-m}
and contains 2^t
points. If k_d=2
, then the d-1
-dim interval contains 2^t
points but has volume 2^{t-m+1}
. I could look at a 2 x 2 grid of original d-dim intervals to show that the sub-intervals of these d-1
-dim intervals contain 2^t
points and have volume 2^{t-m}
. If this carries on for all k_d <= t-m
, this would then tell me that chopping of a dimension of a (t, m, d)-net in base to results in a (t, m, d-1)-net in base 2. Does this then say that chopping off dimensions is safe?
I also found that this is given by lemma 4.16 of https://web.maths.unsw.edu.au/~josefdick/preprints/DP_book_preprint.pdf.
Yeah, that makes sense. I would think that this should easily extend to scrambled sequences as well.
Regardless, I think I would still prefer to generate the initial conditions for the non-fixed dimensions and then add on the fixed dimensions afterwards (as described above), rather than working in the higher-dim space and then fixing ICs.
I'd like to better understand what you're trying to achieve here. It looks like you're fixing the fidelity feature in your input. Are you essentially trying to find the best configuration while constraining your fidelity to a particular value?
@Balandat Yes, that's exactly what I intend to do. Thanks for the workaround, it does solve the error. After doing some tests I observed that using the FixedFeatureAcquisitionFunction
to constrain one or more features, using fixed_features
within optimize_acqf
to do the same thing and using the vanilla optimize_acqf
with bounds
appropriately modified by setting identical lower and upper bounds for the fixed features all give the same result (unsurprisingly to me), although the last approach seems to be the most efficient.
I believe this issue is the same issue here: #578 (comment)
@saitcakmak Yes it is the same issue, somehow that got lost in the weeds.
Overall, I do have a question about the "one shot" approach: it seems as though the KG acquisition function is continuous but not necessarily smooth (see, e.g., https://people.orie.cornell.edu/pfrazier/pub/CorrelatedKG.pdf). Therefore, when you solve the one-shot acquisition function via gradient-based local optimization methods such as L-BFGS
, you can only guarantee a critical point but not necessarily an optimum. Is this a correct assumption or does the one-shot approach make an assumption such as L-smoothness? On a related note, since the number of variables in the one-shot optimization is q + num_fantasies
do we need to pick raw_samples
to be > q + num_fantasies
during optimization? Intuitively, I don't want to pick 100 raw samples when the dimensionality of my optimization problem is ~1000.
although the last approach seems to be the most efficient.
This is interesting. Using the full function and passing degenerate bounds is more efficient in what sense? Faster to compute? Or less complicated to set up?
Overall, I do have a question about the "one shot" approach: it seems as though the KG acquisition function is continuous but not necessarily smooth (see, e.g., https://people.orie.cornell.edu/pfrazier/pub/CorrelatedKG.pdf)
What part of the paper are you referring to specifically? I guess there are two things to note here:
since the number of variables in the one-shot optimization is q + num_fantasies do we need to pick raw_samples to be > q + num_fantasies during optimization? Intuitively, I don't want to pick 100 raw samples when the dimensionality of my optimization problem is ~1000.
Yep that's true. The problem is higher-dimensional so ideally you want to crank up raw_samples
as much as you can. Eventually you'll run into memory / compute limitations, so we don't scale this automatically with the number of fantasies. Note that we're being somewhat more smart when generating ICs for qKG by using a heuristic that utilizes the maximizer of the posterior mean that in practice is often close to the optimal value of the fantasy variables: https://github.com/pytorch/botorch/blob/439c9ef13c0462468239768a08055d1178a9b8ab/botorch/optim/initializers.py#L145-L156
This is interesting. Using the full function and passing degenerate bounds is more efficient in what sense? Faster to compute? Or less complicated to set up?
I had meant wall-clock times. The speedup is ~5x
What part of the paper are you referring to specifically?
See Fig.1 -- in this case, the maximum of the KG is in a locally smooth region, so there's no issue but overall you can see that it is not differentiable continuously. I will try to add an example where the max has to be determined by solving a non-smooth optimization problem.
I will read your paper for the regularity assumptions.
Yep that's true. The problem is higher-dimensional so ideally you want to crank up raw_samples as much as you can. Eventually you'll run into memory / compute limitations, so we don't scale this automatically with the number of fantasies.
Yes, right now I am using ~2000 fantasies and I am not able to increase the raw_samples
beyond say 500 without facing a memory failure. Are there general rules of thumb to pick num_fantasies
for a given problem of dimensions d
? I guess it is highly problem dependent.
I had meant wall-clock times. The speedup is ~5x
Interesting. Do you have some simple code sitting around that compares this? I'd like to profile this to figure out what's going on. It might be that we are generating infeasible ICs for the optimization problem with fixed features right now, which means the optimizer has to work unnecessarily hard (cc @sdaulton)
See Fig.1
From the figure it appears to me that they're considering the noiseless case. In that situation the posterior mean is not differentiable at the observed data points, which looks like what's happening in the figure.
Right now I am using ~2000 fantasies and I am not able to increase the raw_samples beyond say 500 without facing a memory failure. Are there general rules of thumb to pick num_fantasies for a given problem of dimensions d? I guess it is highly problem dependent.
2000 fantasies is a lot. We've gotten decent results with much fewer, say 16-64. As you crank up the number of fantasies you're going to increase the size of your optimization problem a lot, which makes it way harder to solve. I would try to start from a small number of fantasies (in which case you can crank up raw_samples), and then see what the marginal benefits of increasing that number is.
Do you have some simple code sitting around that compares this?
Sure - see below. Actually, the speed-up changes when I set num_fantasies
to be small like 64. Also, the reported times are highly-variable - suggesting that I might have a lot going on on my workstation while I timed this code. However, I see that the results change drastically with the num_fantasies
which I varied from 16-64 and then tried with 1000. Can you point me to an example where you got good results with 16-64? Also, unless I set a common seed, the answer between successive repetitions of the optimize_acqf
is different. I would expect this with a highly nonconvex objective function, however I am wondering if there are flat regions where the optimizer is getting stuck. I did print the gradf
at the end of optimization in gen_candidates_scipy
to check that the gradient is close to zero at the conclusion of the optimization though.
I have tried plotting the acquisition function via evaluate()
and compared its maximizer via the one-shot maximizer and have found different results. I will post an example of that sometime next week.
import math
import torch
from matplotlib import cm
from matplotlib import pyplot as plt
import numpy as np
import botorch
from botorch.fit import fit_gpytorch_model
from botorch.models import SingleTaskGP
from botorch.utils import standardize
from gpytorch.mlls import ExactMarginalLogLikelihood
import gpytorch # main GP library
from botorch.models.gpytorch import GPyTorchModel
from time import time
def griewank_t(x):
t1 = time()
n, p_ = x.shape
p = p_ - 1
if p_ != 3:
raise ValueError('error! inputs must be [...x 3]. ')
y = np.zeros(x.shape[0])
for i in range(x.shape[0]):
theta = torch.tensor(x[i][-1] * math.pi / 4)
xr = rotate_x(theta, x[i][:-1].reshape(-1,2)).numpy()
y[i] = np.sum(np.power(xr, 2) / 4000.0, axis=1) - np.prod(np.divide(np.cos(xr), np.sqrt(np.arange(p)+1)), axis=1) + 1
y[i] = y[i] * np.exp(-(xr[0,0] - 3)**2 /8/(ub-lb) - xr[0,1]**2 /8/(ub-lb))
print(f"elapsed time for oracle eval {time() - t1}")
return 1 * torch.tensor(y)
def rotate_x(theta, test_x):
R = torch.tensor([[torch.cos(theta), -torch.sin(theta)],
[torch.sin(theta), torch.cos(theta)]])
return torch.matmul(test_x, R)
import numpy as np
from botorch.utils.transforms import normalize, standardize, unnormalize
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
training_size = 60
p = 2
lb =-5.0
ub = 5.0
tn = 1.
T = 4.0
#taining data
t = torch.linspace(0, tn, training_size).reshape(-1,1)
train_x = lb + (ub-lb) * torch.rand([training_size, p])
train_x = torch.cat((train_x, t), dim=-1)
train_y = griewank_t(train_x )
# train_y = torch.tensor(train_y, dtype=torch.float32)
# normalize and standardize
actual_bounds = torch.stack([torch.cat([torch.ones(p)*lb, torch.tensor([0.])]),
torch.cat([torch.ones(p)*ub, torch.tensor([T])])])
train_x_n = normalize(train_x, bounds=actual_bounds)
train_y_n = train_y#standardize(train_y)
gp = SingleTaskGP(train_x_n, train_y_n.reshape(-1,1))
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_model(mll);
from botorch.acquisition import qKnowledgeGradient, qMultiFidelityKnowledgeGradient
from botorch.acquisition import UpperConfidenceBound, PosteriorMean, ExpectedImprovement
# project fidelity to T
T = 1.0
def project(x):
return torch.cat([x[..., :-1], torch.full_like(x[..., -1:], T)], dim=-1)
# MFKG
qMFKG = qMultiFidelityKnowledgeGradient(gp, num_fantasies=32, project=project)
# UCB
ucb = UpperConfidenceBound(gp, beta=2.)
# Posterior mean
pm = PosteriorMean(gp)
from botorch.acquisition import FixedFeatureAcquisitionFunction
T = 1.0
columns = [p]
values = [train_x_n[-1, -1]]
qKG_FF = FixedFeatureAcquisitionFunction(qMFKG, p+1, columns, values)
from botorch.optim.initializers import gen_one_shot_kg_initial_conditions
from botorch.optim import optimize_acqf
from botorch.utils.sampling import manual_seed
from time import time
bounds = torch.cat( [torch.zeros(p+1).reshape(1,-1), torch.ones(p+1).reshape(1,-1)], dim=0)
ff_bounds = bounds[...,:-1].reshape(2,-1)
t_bounds = bounds
t_bounds[...,-1] = train_x_n[-1, -1]
with manual_seed(1235):
batch_initial_conditions = gen_one_shot_kg_initial_conditions(
acq_function=qMFKG,
bounds=t_bounds,
q=1,
num_restarts=20,
raw_samples=250,
)
ff_batch_initial_conditions = batch_initial_conditions[..., :-1]
t1 = time()
with manual_seed(1235):
candidates, acq_value = optimize_acqf(
acq_function=qKG_FF,
bounds=ff_bounds,
q=1,
num_restarts=20,
raw_samples=250,
batch_initial_conditions=ff_batch_initial_conditions,
)
print(f'elapsed time {time()-t1}')
print(f'candidate {candidates}')
t1 = time()
with manual_seed(1235):
candidates, acq_value = optimize_acqf(
acq_function=qMFKG,
bounds=t_bounds,
q=1,
num_restarts=20,
raw_samples=250,
fixed_features={p:train_x_n[-1, -1].item()}
)
print(f'elapsed time {time()-t1}')
print(f'candidate {candidates}')
t1 = time()
with manual_seed(1235):
candidates, acq_value = optimize_acqf(
acq_function=qMFKG,
bounds=t_bounds,
q=1,
num_restarts=20,
raw_samples=250,
# fixed_features={p:train_x_n[-1, -1].item()}
)
print(f'elapsed time {time()-t1}')
print(f'candidate {candidates}')
Can you point me to an example where you got good results with 16-64?
If I recall correctly, most of the lower-dim synthetic test functions did work quite well with 32-64 fantasy samples.
the reported times are highly-variable - suggesting that I might have a lot going on on my workstation while I timed this code
Not necessarily. Since we're randomizing the initial conditions, we may get lucky or not with our starting points for LBFGS-B, which can have a significant effect on the optimization result and necessary number of iterations. High variance in the optimization time is not surprising.
unless I set a common seed, the answer between successive repetitions of the optimize_acqf is different.
I ran your example, the results appear pretty consistent (small numerical differences): repro_botorch_issue_626.ipynb.txt
I am wondering if there are flat regions where the optimizer is getting stuck.
This can indeed happen. Depending on the model, acquisition functions such as KG or EI are notorious for being flat (zero, actually) across much of the search space in some situations. We do our best to exploit fast batch evaluations to splatter a large number of initial conditions on the search space and run a Boltzmann-sampling heuristic on those in order to minimize the risk of getting stuck in the flat parts, but at the end of the day there is a lot of room in R^n
for large n
...
Can you point me to an example where you got good results with 16-64?
You could also check out this google sheet (https://docs.google.com/spreadsheets/d/1DWQxD-qr-7SQVLT64da3qs0FdsiF1UblFALAMIqIRmI/edit?usp=sharing) which reports experiment results for the one-shot and some other approaches for optimizing KG. There are a number of test functions studied there, where each tab is for a given test function and a given sampling budget. If a function name has a number attached to it, that is the function dimension, otherwise it is the default dimension given in BoTorch. For your question, the lines with OSKG
and OSKG num_fant-16
would be of interest, which report one-shot KG performance using 64 and 16 fantasies respectively. Some tabs also have nested-KG reported, which is just KG optimized in a nested manner (64 fantasies). The takeaway is that for functions with <10 dims, 16 vs 64 fantasies does not make too much difference, and one may beat the other based on the particular problem structure or the computational budget assigned to it.
FWIW, in (https://arxiv.org/abs/2007.05554) we used 10 fantasies across the board (mainly for computational reasons) on 4-7 dim functions with good success.
@Balandat @saitcakmak
I tested KG in v 0.3.3 and I can confirm that the one-shot maximum and the KG.evaluate()
maximum are consistent, provided there are enough restarts. See below for an example of how the restarts matter, with num_fantasies = 32
. The notebook is also attached. I guess the most important realization for me was that the num_fantasies
does not need to be as high as I thought it should -- I was thinking along the lines of the number of samples required for a MC approximation.
Overall, I do see that for observation noise ~<1e-3 the acquisition function does look non-smooth. I should use caution here because I am approximating the acquisition function on a rather coarse grid. However, I checked that the stopping criteria for L-BFGS in SciPy is not necessarily the first order optimality but also the tolerance on function values. I separately checked optimizing other non-smooth functions to verify this.
Thanks for all your response - much appreciated!
Great. I assume this can be considered resolved then, feel free to reopen if not.
Related tp #625 but opening a new issue since that one was closed.
versions
torch/gpytorch/botorch 1.7.0-cpu/1.3.0/0.3.3
Reproducible example