Open AdrianSosic opened 2 weeks ago
This isn't supported as of BoTorch 0.11.1; Standardize
now raises an exception when given empty data. standardize
isn't recommended; it's generally better to use an input transform rather than standardize the data ahead of time, because then the predictions will be on the original scale. So I don't think there's a good way to do this with both standardization and empty data.
We should update the documentation to avoid references to standardize
, and either deprecate it or clarify its behavior with empty data.
As for behavior with empty data, the GP will give the same posterior distribution at any point, so an acquisition function like UCB will have the same value at every point, and optimize_acqf
won't be able to make any progress. It will return something random.
To avoid relying on a model whose fit is likely poor, and to ensure diversity, It's common to generate the first five or so candidates using Sobol quasi-random points, e.g. with draw_sobol_samples
.
Hi @esantorella, thanks for the quick answer βπΌ. I'm aware of the conceptual problems that arise when working with empty training data, and I'm also clear about the difference between the two scalers, but let me reply point by point:
This isn't supported as of BoTorch 0.11.1;
Standardize
now raises an exception when given empty data.standardize
isn't recommended; it's generally better to use an input transform rather than standardize the data ahead of time, because then the predictions will be on the original scale. So I don't think there's a good way to do this with both standardization and empty data.We should update the documentation to avoid references to
standardize
, and either deprecate it or clarify its behavior with empty data.
Thanks for the info, I now also saw your PR that implemented these changes. I think if you plan to discourage users from using standardize
, I recommend updating the code on the landing page, which is the first point of contact with the package for many users π
As for behavior with empty data, the GP will give the same posterior distribution at any point, so an acquisition function like UCB will have the same value at every point, and
optimize_acqf
won't be able to make any progress. It will return something random.To avoid relying on a model whose fit is likely poor, and to ensure diversity, It's common to generate the first five or so candidates using Sobol quasi-random points, e.g. with
draw_sobol_samples
.
I think I don't completely agree with this statement. While this is true for most situations and, in the average use case, people are probably better off resorting to alternatives (like Sobol sampling), your claim does not hold in all scenarios:
sequential=True
, with the exception that the first point of the batch would remain random.What I'm trying to say here: I do get your point, but because of the way standardization is handled, it's currently impossible to create a single recommendation model that behaves consistently (i.e. applies the exact same logic) across all training data sizes. That is, is you created a plot of the "performance" of that model where on the x-axis you have the number of training data points, you would not be able to get a smooth curve that fully extends to 0 on its left end, because that case currently requires applying a different model logic βΒ even though conceptually the same logic could be applied, as my examples above demonstrate.
Perhaps I'm overlooking an important aspect here, so please feel free to correct me if I'm mistaken π
That all sounds right to me, with the proviso that you'd need to use an acquisition function that supports q > 1
in the batch case, e.g. use qUpperConfidenceBound
rather than UpperConfidenceBound
. (You may know this, but not everyone else reading this will.) I especially agree that the documentation should definitely be updated (PRs welcome!).
There are two approaches we could go with here:
Standardize
transform can continue to not support the case of empty data. Users should just not use a data-dependent transform like Standardize
in this case. This makes the behavior of BoTorch more transparent (because it's not obvious how empty data would be standardized).Standardize
can support empty data. It will need to set some mean and standard deviation (say to 0 and 1) in order to be able to untransform predictions. This might make things a little easier on the user, but at the cost of more code complexity and nonobvious behavior within BoTorch.Would it serve your use case to just not use a transform when data is empty?
Simply not transforming in case of empty data would certainly work and would have been my manual workaround for it ππΌ And I agree that there are two reasonable options, namely 1) to handle the ill-defined cases internally or 2) to raise errors.
However, I see a discrepancy between my expectation and the current implementation, and I think the current logic of Standardize
is a somewhat weird/inconsistent mix of both alternatives.
To explain what I mean, let us consider the two scenarios that can happen:
N>2
. In this case, everything is fine, no need for custom logic / errors.N=1
, the mean estimate is still intact, but the Bessel corrected standard deviation estimate breaks. For N=0
, both estimates break.The current version applies a manual fix for N=1
by setting the standard deviation to the arbitrary value of 1. But then, in the similarly ill-defined setting of N=0
, where the same fix could be applied to the mean by setting it to 0 (since the decision "let's internally handle the breaking cases" has already been made), an error is thrown instead.
Long story short: to me it would seem much more consistent to either say "nope, not applying any custom logic here" for all cases N<2
, or to fix all of them. Because I don't see how fixing only the standard deviation is less invasive than fixing both.
I agree with this, and found it confusing to understand what was happening in the N=1
and N=0
cases the last time I looked at the Standardize
code. I'd be fine with supporting neither (with a clear error message) or both (with clear documentation on each case). I'm not sure I'll have time to do it myself promptly, but PRs are always welcome. :)
On second thought, I think Standardize
should be supported with one data point -- if model-fitting with one data point is supported at all -- because the priors don't make sense and model-fitting may struggle if the data isn't standardized. Here is an example where failing to use a transform results in a posterior mean that is even farther from zero than the data, which is pretty weird.
On the other hand, the second model seems highly overconfident in the presence of such large noise, so setting the standard deviation to 1 may be a bad choice in the presence of observed noise. It might be more sensible to set it to sqrt(yvar)
or something, which would give a posterior that looks like this:
We could also conclude that we are not going to get reasonable predictions with 0 or 1 data points and disallow it entirely.
Hi @esantorella, thank for creating the examples. Before we continue or discussion on whether or not to allow the degenerate cases, let's perhaps first figure out what's going wrong here!
First of all, thanks for bringing up the train_Yvar
option βΒ I actually didn't have this on my radar at all. Of course, that needs to be taken into account properly, and I also think that something like sqrt(yvar)
would probably make sense. But the elephant in the room is definitely: how on earth can the posterior mean be above the data point in the unscaled case!? Clearly something is fishy here (or we are completely overlooking an important aspect)! Before we can continue to talk about scaling, I think we need to find the cause for that. Could it be that something in the fixed noise model is set up incorrectly?
Yeah good question. I'm not going to have a chance to look closely into this right away, but my best guess would be that failing to standardize leads to difficulty with model fit, similar to #2392. The prior probability of having a data point at 1000 is very low, especially since the provided yvar nearly rules out the possibility that this is noise. So the marginal likelihood might be very flat, and tiny, near the optimum, causing numerical convergence troubles. Just a guess though!
π Bug
I am currently playing around with situations where there's no training data available yet and noticed that the behaviors of
utils.standardize
andtransforms.Standardize
are inconsistent.To reproduce
Here a minimal example adopted from the code on the landing page. When you run the original code for the GP creation (in the comments), everything works fine. However, when you run the displayed version, you get the error shown below.
Code snippet to reproduce
Stack trace/error message
Expected Behavior
In both cases, the GP posterior should simply match the GP posterior and the standardization applied along the way should not mess with the computation. I haven't checked what exact logic is applied internally when there is no data / only one data point being passed to the transformation, but my intuition would tell me that what should happen is:
System information
Please complete the following information: