pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3.06k stars 390 forks source link

[Bug] The fit_gpytorch_mll() will crash after a period of normal fit. #2545

Open lixiangru123 opened 3 hours ago

lixiangru123 commented 3 hours ago

🐛 Bug

To reproduce

Code snippet to reproduce

warnings.filterwarnings("ignore")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.double
tkwargs = {"device": device, "dtype": dtype}

SMOKE_TEST = os.environ.get("SMOKE_TEST")

Area = torch.tensor(A_list, **tkwargs).unsqueeze(-1)
L40 = torch.tensor(L40, **tkwargs).unsqueeze(-1)
Q40 = torch.tensor(Q40, **tkwargs).unsqueeze(-1)
fSRF = torch.tensor(fSRF, **tkwargs).unsqueeze(-1)

C1=torch.tensor(Area-10000, **tkwargs)
C2=torch.tensor((abs(L40-Lt)/Lt)-0.05, **tkwargs)
C3 = torch.tensor(Qt-Q40, **tkwargs)
C4 = torch.tensor(80-fSRF, **tkwargs)

def get_fitted_model(X, Y):
    likelihood = GaussianLikelihood(noise_constraint=Interval(1e-8, 1e-3))
    covar_module = ScaleKernel(  # Use the same lengthscale prior as in the TuRBO paper
        MaternKernel(nu=2.5, ard_num_dims=dim, lengthscale_constraint=Interval(0.005, 4.0))
    )

    model = SingleTaskGP(
        X,
        Y,
        covar_module=covar_module,
        likelihood=likelihood,
        outcome_transform=Standardize(m=1),
    )
    mll = ExactMarginalLogLikelihood(model.likelihood, model)

    with gpytorch.settings.max_cholesky_size(max_cholesky_size):
        fit_gpytorch_mll(mll)

    return model

model = get_fitted_model(train_X, Area)
    c1_model = get_fitted_model(train_X, C1)
    c2_model = get_fitted_model(train_X, C2)
    c3_model = get_fitted_model(train_X, C3)
    c4_model = get_fitted_model(train_X, C4)

Stack trace/error message

54) Best value: 1.23e+03, TR length: 1.25e-02
Traceback (most recent call last):
  File "/home/eda240601/SCBO_main.py", line 307, in <module>
    model = get_fitted_model(train_X, Area)
  File "/home/eda240601/SCBO_main.py", line 243, in get_fitted_model
    fit_gpytorch_mll(mll)
  File "/home/eda240601/.local/lib/python3.9/site-packages/botorch/fit.py", line 105, in fit_gpytorch_mll
    return FitGPyTorchMLL(
  File "/home/eda240601/.local/lib/python3.9/site-packages/botorch/utils/dispatcher.py", line 93, in __call__
    return func(*args, **kwargs)
  File "/home/eda240601/.local/lib/python3.9/site-packages/botorch/fit.py", line 283, in _fit_fallback
    raise ModelFittingError(msg)
botorch.exceptions.errors.ModelFittingError: All attempts to fit the model have failed. For more information, try enabling botorch.settings.debug mode.

Expected Behavior

System information

Please complete the following information:

Additional context

My code will fit the Gaussian process to the input as well as Area, C1.C2,C3,C4.I've tried to standardize the constraints and output, and use the double type, but the code still crashes.I hope you can give me some advice, thank you very much!

saitcakmak commented 2 hours ago

Hi @lixiangru123. The model fitting errors are typically quite sensitive to the exact data used during model training. It'd be very difficult to reproduce the error without the X, Y and a random seed that was used while training the model.

These errors typically results from issues due to numerical precision. For example, having observations that are very closed in the input space can produce problematic gradient values, leading to the optimizer exiting with an error. Adding a little bit of noise into the recent observations, or modifying the input data in other ways can help get around these issues.

If I recall correctly, we've also seen similar issues due to the lengthscales reaching near the boundary of the Interval constraint. You could try passing in transform=None to the constraint to see if that helps.

lixiangru123 commented 1 hour ago
Area 1450.9348167559276

C_NEXT tensor([[-8.5491e+03,  6.9746e-02, -7.1801e+00, -2.0000e+01]])
65) No feasible point yet! Smallest total violation: 3.81e-02, TR length: 1.00e-01

Area 1726.2291654813205
C_NEXT tensor([[-8.2738e+03,  1.5908e-01, -8.9915e+00, -2.0000e+01]])
66) No feasible point yet! Smallest total violation: 3.81e-02, TR length: 1.00e-01

Area 1711.2042492736352
C_NEXT tensor([[-8.2888e+03,  2.1350e-01, -7.7413e+00, -2.0000e+01]])
67) No feasible point yet! Smallest total violation: 3.81e-02, TR length: 5.00e-02

Area1329.9018383761756
C_NEXT tensor([[-8.6701e+03, -4.3031e-02, -7.6398e+00, -2.0000e+01]])
68) Best value: 1.33e+03, TR length: 5.00e-02
Area [[3.3295118826441463, 2.2468109189998353, 1, 23.725421047303826]]
69
Area 1393.5339231914709
C_NEXT tensor([[-8.6065e+03,  9.9990e-03, -7.5499e+00, -2.0000e+01]])
69) Best value: 1.33e+03, TR length: 5.00e-02

Area 1530.9920055199084
C_NEXT tensor([[-8.4690e+03,  5.5790e-02, -8.5064e+00, -2.0000e+01]])
70) Best value: 1.33e+03, TR length: 5.00e-02

Area 1534.6661586830724
C_NEXT tensor([[-8.4653e+03,  5.6825e-02, -8.5425e+00, -2.0000e+01]])
71) Best value: 1.33e+03, TR length: 5.00e-02

Area 1499.7136464473438
C_NEXT tensor([[-8.5003e+03,  3.9521e-02, -8.3981e+00, -2.0000e+01]])
72) Best value: 1.33e+03, TR length: 2.50e-02

Area 1430.9510075271141
C_NEXT tensor([[-8.5690e+03,  7.4139e-03, -8.0741e+00, -2.0000e+01]])
73) Best value: 1.33e+03, TR length: 2.50e-02

Area 1426.8588518820643
C_NEXT tensor([[-8.5731e+03,  6.6877e-03, -8.0375e+00, -2.0000e+01]])
74) Best value: 1.33e+03, TR length: 2.50e-02

Area 1430.7775522979998
C_NEXT tensor([[-8.5692e+03,  6.7545e-03, -8.0916e+00, -2.0000e+01]])
75) Best value: 1.33e+03, TR length: 2.50e-02

Area1426.2029884042142
C_NEXT tensor([[-8.5738e+03,  5.9937e-03, -8.0666e+00, -2.0000e+01]])
76) Best value: 1.33e+03, TR length: 1.25e-02

Area1379.7032829916407
C_NEXT tensor([[-8.6203e+03, -1.8542e-02, -7.8631e+00, -2.0000e+01]])
77) Best value: 1.33e+03, TR length: 1.25e-02

Traceback (most recent call last):
  File "/home/eda240601/SCBO_main.py", line 311, in <module>
    c4_model = get_fitted_model(train_X, C4)
  File "/home/eda240601/SCBO_main.py", line 243, in get_fitted_model
    fit_gpytorch_mll(mll)
  File "/home/eda240601/.local/lib/python3.9/site-packages/botorch/fit.py", line 105, in fit_gpytorch_mll
    return FitGPyTorchMLL(
  File "/home/eda240601/.local/lib/python3.9/site-packages/botorch/utils/dispatcher.py", line 93, in __call__
    return func(*args, **kwargs)
  File "/home/eda240601/.local/lib/python3.9/site-packages/botorch/fit.py", line 283, in _fit_fallback
    raise ModelFittingError(msg)
botorch.exceptions.errors.ModelFittingError: All attempts to fit the model have failed. For more information, try enabling botorch.settings.debug mode.

Thanks for your suggestion, my new modification is as follows: The above is my latest crash log. I think the constraint may have reached the boundary and lost the gradient, leading to problems in the data set. However, due to the limitation of the optimization problem itself, I do not know how to modify the value of this constraint to make more differences between values. In addition,i don't know how to modify the Interval and Transform=Noneyou mentioned. My understanding is the following code.

def get_fitted_model(X, Y):
    likelihood = GaussianLikelihood(noise_constraint=Interval(1e-8, 1e-3),transform=None)
    covar_module = ScaleKernel(  # Use the same lengthscale prior as in the TuRBO paper
        MaternKernel(nu=2.5, ard_num_dims=dim, lengthscale_constraint=Interval(0.005, 10.0))#0.005,4
    )

    model = SingleTaskGP(
        X,
        Y,
        covar_module=covar_module,
        likelihood=likelihood,
        outcome_transform=Standardize(m=1),
    )
    mll = ExactMarginalLogLikelihood(model.likelihood, model)

    with gpytorch.settings.max_cholesky_size(max_cholesky_size):
        fit_gpytorch_mll(mll)

    return model