AdrianSosic commented 2 weeks ago

🐛 Bug

I am currently playing around with situations where there's no training data available yet and noticed that the behaviors of utils.standardize and transforms.Standardize are inconsistent.

To reproduce

Here a minimal example adopted from the code on the landing page. When you run the original code for the GP creation (in the comments), everything works fine. However, when you run the displayed version, you get the error shown below.

Code snippet to reproduce

import torch
from botorch.acquisition import UpperConfidenceBound
from botorch.models import SingleTaskGP
from botorch.models.transforms import Standardize
from botorch.optim import optimize_acqf
from botorch.utils import standardize

# >>>>> changed code
# Unlike in the main page example, we set the number of training points to 0
train_X = torch.rand(0, 2)
# <<<<< changed code

Y = 1 - torch.linalg.norm(train_X - 0.5, dim=-1, keepdim=True)
Y = Y + 0.1 * torch.randn_like(Y)

# >>>>> original version
# train_Y = standardize(Y)
# gp = SingleTaskGP(train_X, train_Y)
# -----
train_Y = Y
gp = SingleTaskGP(train_X, train_Y, outcome_transform=Standardize(1))
# <<<<< version throwing error

# >>>>> changed code
# Because there is no training data, we do not fit the model parameters
gp.eval()
# <<<<< changed code

UCB = UpperConfidenceBound(gp, beta=0.1)
bounds = torch.stack([torch.zeros(2), torch.ones(2)])
candidate, acq_value = optimize_acqf(
    UCB,
    bounds=bounds,
    q=1,
    num_restarts=5,
    raw_samples=20,
)

Stack trace/error message

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Expected Behavior

In both cases, the GP posterior should simply match the GP posterior and the standardization applied along the way should not mess with the computation. I haven't checked what exact logic is applied internally when there is no data / only one data point being passed to the transformation, but my intuition would tell me that what should happen is:

for one data point, standardization should simply use the point's value as mean and impose some default value (e.g. 1) for the estimated standard deviation
for zero data, same as above but using zero as mean value.

System information

Please complete the following information:

BoTorch Version: 0.11.0
GPyTorch Version: 1.11
PyTorch Version: 2.3.0
OS: macOS

esantorella commented 2 weeks ago

This isn't supported as of BoTorch 0.11.1; Standardize now raises an exception when given empty data. standardize isn't recommended; it's generally better to use an input transform rather than standardize the data ahead of time, because then the predictions will be on the original scale. So I don't think there's a good way to do this with both standardization and empty data.

We should update the documentation to avoid references to standardize, and either deprecate it or clarify its behavior with empty data.

As for behavior with empty data, the GP will give the same posterior distribution at any point, so an acquisition function like UCB will have the same value at every point, and optimize_acqf won't be able to make any progress. It will return something random.

To avoid relying on a model whose fit is likely poor, and to ensure diversity, It's common to generate the first five or so candidates using Sobol quasi-random points, e.g. with draw_sobol_samples.

AdrianSosic commented 2 weeks ago

Hi @esantorella, thanks for the quick answer ✌🏼. I'm aware of the conceptual problems that arise when working with empty training data, and I'm also clear about the difference between the two scalers, but let me reply point by point:

This isn't supported as of BoTorch 0.11.1; Standardize now raises an exception when given empty data. standardize isn't recommended; it's generally better to use an input transform rather than standardize the data ahead of time, because then the predictions will be on the original scale. So I don't think there's a good way to do this with both standardization and empty data.

We should update the documentation to avoid references to standardize, and either deprecate it or clarify its behavior with empty data.

Thanks for the info, I now also saw your PR that implemented these changes. I think if you plan to discourage users from using standardize, I recommend updating the code on the landing page, which is the first point of contact with the package for many users 🙃

As for behavior with empty data, the GP will give the same posterior distribution at any point, so an acquisition function like UCB will have the same value at every point, and optimize_acqf won't be able to make any progress. It will return something random.

To avoid relying on a model whose fit is likely poor, and to ensure diversity, It's common to generate the first five or so candidates using Sobol quasi-random points, e.g. with draw_sobol_samples.

I think I don't completely agree with this statement. While this is true for most situations and, in the average use case, people are probably better off resorting to alternatives (like Sobol sampling), your claim does not hold in all scenarios:

If you request a single recommendation -> agree. However, if you optimize a batch, the situation looks already different. Here, the effect you'd get is similar to farthest point sampling / determinantal point processes, where your recommendations would be generated to be most "dissimilar", which I think is still a very reasonable design. For instance, when optimizing on a hyperrectangle space, the first few recommendations would land on the corners of the space. This is even true when setting sequential=True, with the exception that the first point of the batch would remain random.
If you use a nonstationary kernel, the above described effect would already happen for a single requested point.

What I'm trying to say here: I do get your point, but because of the way standardization is handled, it's currently impossible to create a single recommendation model that behaves consistently (i.e. applies the exact same logic) across all training data sizes. That is, is you created a plot of the "performance" of that model where on the x-axis you have the number of training data points, you would not be able to get a smooth curve that fully extends to 0 on its left end, because that case currently requires applying a different model logic – even though conceptually the same logic could be applied, as my examples above demonstrate.

Perhaps I'm overlooking an important aspect here, so please feel free to correct me if I'm mistaken 😃

esantorella commented 1 week ago

That all sounds right to me, with the proviso that you'd need to use an acquisition function that supports q > 1 in the batch case, e.g. use qUpperConfidenceBound rather than UpperConfidenceBound. (You may know this, but not everyone else reading this will.) I especially agree that the documentation should definitely be updated (PRs welcome!).

There are two approaches we could go with here:

Status quo: The Standardize transform can continue to not support the case of empty data. Users should just not use a data-dependent transform like Standardize in this case. This makes the behavior of BoTorch more transparent (because it's not obvious how empty data would be standardized).
To facilitate having a consistent API across data sizes, Standardize can support empty data. It will need to set some mean and standard deviation (say to 0 and 1) in order to be able to untransform predictions. This might make things a little easier on the user, but at the cost of more code complexity and nonobvious behavior within BoTorch.

Would it serve your use case to just not use a transform when data is empty?

AdrianSosic commented 1 week ago

Simply not transforming in case of empty data would certainly work and would have been my manual workaround for it 👍🏼 And I agree that there are two reasonable options, namely 1) to handle the ill-defined cases internally or 2) to raise errors.

However, I see a discrepancy between my expectation and the current implementation, and I think the current logic of Standardize is a somewhat weird/inconsistent mix of both alternatives.

To explain what I mean, let us consider the two scenarios that can happen:

Either, we are in a regime where the estimators are well defined, which is for N>2. In this case, everything is fine, no need for custom logic / errors.
Or, we are in a regime where the estimators break and it is no longer obvious what to do, i.e. things become a design decision. For N=1, the mean estimate is still intact, but the Bessel corrected standard deviation estimate breaks. For N=0, both estimates break.

The current version applies a manual fix for N=1 by setting the standard deviation to the arbitrary value of 1. But then, in the similarly ill-defined setting of N=0, where the same fix could be applied to the mean by setting it to 0 (since the decision "let's internally handle the breaking cases" has already been made), an error is thrown instead.

Long story short: to me it would seem much more consistent to either say "nope, not applying any custom logic here" for all cases N<2, or to fix all of them. Because I don't see how fixing only the standard deviation is less invasive than fixing both.

esantorella commented 1 week ago

I agree with this, and found it confusing to understand what was happening in the N=1 and N=0 cases the last time I looked at the Standardize code. I'd be fine with supporting neither (with a clear error message) or both (with clear documentation on each case). I'm not sure I'll have time to do it myself promptly, but PRs are always welcome. :)

esantorella commented 1 week ago

On second thought, I think Standardize should be supported with one data point -- if model-fitting with one data point is supported at all -- because the priors don't make sense and model-fitting may struggle if the data isn't standardized. Here is an example where failing to use a transform results in a posterior mean that is even farther from zero than the data, which is pretty weird.

On the other hand, the second model seems highly overconfident in the presence of such large noise, so setting the standard deviation to 1 may be a bad choice in the presence of observed noise. It might be more sensible to set it to sqrt(yvar) or something, which would give a posterior that looks like this:

We could also conclude that we are not going to get reasonable predictions with 0 or 1 data points and disallow it entirely.

AdrianSosic commented 1 week ago

Hi @esantorella, thank for creating the examples. Before we continue or discussion on whether or not to allow the degenerate cases, let's perhaps first figure out what's going wrong here!

First of all, thanks for bringing up the train_Yvar option – I actually didn't have this on my radar at all. Of course, that needs to be taken into account properly, and I also think that something like sqrt(yvar) would probably make sense. But the elephant in the room is definitely: how on earth can the posterior mean be above the data point in the unscaled case!? Clearly something is fishy here (or we are completely overlooking an important aspect)! Before we can continue to talk about scaling, I think we need to find the cause for that. Could it be that something in the fixed noise model is set up incorrectly?

esantorella commented 1 week ago

Yeah good question. I'm not going to have a chance to look closely into this right away, but my best guess would be that failing to standardize leads to difficulty with model fit, similar to #2392. The prior probability of having a data point at 1000 is very low, especially since the provided yvar nearly rules out the possibility that this is noise. So the marginal likelihood might be very flat, and tiny, near the optimum, causing numerical convergence troubles. Just a guess though!

pytorch / botorch

[Bug] Inconsistent behavior: `standardize` vs. `Standardize` with n < 2 #2394

🐛 Bug

To reproduce

Expected Behavior

System information