pytorch / botorch

Bayesian optimization in PyTorch
https://botorch.org/
MIT License
3k stars 383 forks source link

Classification Model as Output Constraint #725

Open jduerholt opened 3 years ago

jduerholt commented 3 years ago

Dear botorch developers,

I have a question regarding output constraints. So far they are used and implemented in the following way:

Now comes my question:

How to set up a optimization in which a "real" classification model is needed. Imagine achemical experiment in which only certain input combinations lead to a "successful" experiments. Successful in this context means that it is possible to measure the properties that should be optimized. For the failed experiments, no outputs are available, and within the BO loop one wants to propose only experiments that lead to succesful experiments.

I found one paper in this context. They are using an in the loop classifier to judge if a experiment will be succesful or not. I think there should be better solutions using the technique on the output constraints described above. One could for example train directly a classification model using gpytorch and multiply its output with the actual objective.

What are your thoughts on this and have you ever tried something related within botorch?

Best,

Johannes

Balandat commented 3 years ago

Hi @jduerholt, great question.

Using a Probit classification model with a Bernoulli likelihood for actual classification should work in this context. You're right that since such model maps the latent function directly to a probability of success, one should be able to use the same approach as with the existing constrained formulation, but using this success probability directly (here it is: https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/likelihoods/bernoulli_likelihood.py#L26).

We could probably just use the vanilla GP Regression model from https://github.com/cornellius-gp/gpytorch/blob/master/examples/04_Variational_and_Approximate_GPs/Non_Gaussian_Likelihoods.ipynb to prototype something (note that this uses variational inference). The main reason we haven't tried this out is that we so far didn't run into any concrete use cases in our work, but it would be interesting to try this out.

I recall reading a paper that did this within the last year or two, but I can't find it right now.

An interesting question is whether one would want to use a multi-task model for modeling the latent constraint function and the objective(s), the rationale being that performance and feasibility may be quite correlated. But that's a refinement and not necessary as a first step.

jduerholt commented 3 years ago

Hi @Balandat,

thanks for the quick response. A prototype based on a single task model would be sufficient for the start, if this works properly one could extend it to a multi-task model, which could boost performance again.

Would you do the prototyping, or is it too far from the typical Facebook use cases?

Best,

Johannes

Balandat commented 3 years ago

I can throw something together in a notebook on some toy problem, kind of curious myself how this works. Might take a bit to get this cleaned up and make it conveniently usable though.

Balandat commented 3 years ago

Made some initial strides, making a BoTorch model based on the simple classification GP is easy, hooking things up into the model API will require a bit of thinking since it wasn't initially designed for classification models. We can probably (i) implement a Posterior that generically represents probability values, and then (ii) make some abstraction that allows to easily run BO for some outcome model (with standard posterior) subject to one or more constraint models (with the new probability posterior. This should at least cover the case of modeling the outcome and the feasibility independently.

jduerholt commented 3 years ago

Hi @Balandat,

this sounds like a plan. Modelling outcome and feasibility independently looks like the key for me.

Another possibility for cetain cases could be to X values in the objective to run them through a simple classifier (like logistic regression) at this point. But this violates somehow the way botorch is designed.

Thanks for your efforts. If we get it running properly this could be from my perspective also something publisheable since I have not found so many things on this topic.

Best,

Johannes

Balandat commented 3 years ago

wondering if @bletham or @mshvartsman have any thoughts here, as they were working on similar / related things.

bletham commented 3 years ago

Yeah ultimately in constrained EI the constraint shows up in the acquisition function as Probability(x is feasible). In the usual case where constraints are continuous valued and observed this probability can be computed from a GP, but it could just as easily be a probability from a logistic regression. The challenge is just differentiability and being able to backprop through that Prob(feasible).

My only comment would be that we've done something along these lines with a GP classifier, in which a GP is used to model the latent probability and then it gets a Bernoulli likelihood on top. And that doesn't work well in a lot of practical settings because the GP will assume smoothness in the latent fail/success probabilities, and often in real problems there are in reality hard edges where a small changes in parameters can take it from failing every time to succeeding every time. So something like an SVM or RandomForest classifier tends to work better for those types of problems. But then you can't get the backprop inside Botorch.

Balandat commented 3 years ago

@jduerholt, curious if you have any intuition as to whether in your case there is reason to expect / not expect smoothness in the probability of feasibility.

jduerholt commented 3 years ago

Thanks for the comments @bletham. @Balandat, I will take some time next week to investigate which type of classifier fits best to the problem and if one can expect smoothness. When I have the answer, I come back to you and we can discuss what is then the best solution.

Have a nice weekend!

Best,

Johannes

mshvartsman commented 3 years ago

I probably don't have as much experience as @bletham with constrained BO of this flavor, so consider all of this with a grain of salt, but:

  1. You don't have to use an RBF kernel, right? Something like Matern might be more reasonable if you're not expecting as much smoothness.
  2. I think that while formally you may have smoothness in f(x), with some creativity with the outputscale prior you may be able to get pretty sharp cutpoints numerically on the probability phi(f(x)). In the limit where the scale on f(x) is very large, phi(f(x)) will always be 0 or 1. You could also try other transformations before the probit to really jam the probability towards 0 and 1 faster if you think that's necessary.

I'd probably try getting the differentiable GP classifier to work first even if you have doubts about smoothness, before going down the harder path of non-differentiable constraint modeling.

bletham commented 3 years ago

The issue we ran into when trying to do something very similar to this was that a sharp transition from p=0 to p=1 (always fails to always succeeds) forced the GP to fit a very short lengthscale, which then meant it couldn't generalize. But I agree that there are GP formulations (perhaps with a non-stationary kernel) that probably could have worked.

mshvartsman commented 3 years ago

Good point! The setting where we've seen reasonable behavior wasn't stationary, strictly speaking. Non-stationary kernel or input warping seems like it might be useful if one were to try this.

Balandat commented 2 years ago

Adding @dme65 and @jtwilson who have been making strides towards this recently.

JoseValenR commented 2 years ago

Hi all, I was also looking for handling experiment failures and came across this issue. Is there any plan for implementing it?

Balandat commented 2 years ago

Yeah we're working on implementing this with a new probabilistic classification model. I don't have a timeline for open source for this yet, but hopefully we have something preliminary ready in a couple of months or so.

If you need something quicker it may make sense to try to hook something like this up: https://github.com/cornellius-gp/gpytorch/blob/master/examples/04_Variational_and_Approximate_GPs/PolyaGamma_Binary_Classification.ipynb

JoseValenR commented 2 years ago

Look forward to it... in the meantime I ll look into the example provided

jduerholt commented 2 years ago

Same here, looking forward to it. This will be very useful for some use cases ;)

jduerholt commented 6 months ago

Coming back to this issue after a long time.

Let's assume that we have an arbitrary differentiable classification model $g$, which returns as function of $x$ a vector of class probabilities (which sum up to 1). Furthermore we have a bit vector of the same length which encodes the desirable class for the constrained optimization.

An example: $g(x) = [0.8, 0.1, 0.1]$, the desirability matrix is given by $d = [1, 0, 0]$ which means that we only desirable outcome is class 1. Then the overall desirable probability is given by the dot product of the two vectors $p = g(x) d^T$. We want to use this probability directly as an outcome constraint in an optimization. To do so, one has to somehow invert the sigmoid operation, this can be achieved by pushing $p$ through $log(1/p -1)$. When then sigmoid is applied to this term this results in an identitiy operation for 0<p<1. For the edge case of $p$ approaching zero or 1 one needs to introduce some numeric offset which is neglected here.

From my perspective, this is a lot of overhead for an indentity operation, as we directly want to use the probability $p$ for the output constraint. So my question is, do you have an idea how to do this more elegantly or how to implement this indentity constraint handling in the best way directly in botorch?

cc @gmancino

jduerholt commented 6 months ago

@Balandat: any thoughts on this?

Balandat commented 6 months ago

Sorry for the delay, busy times...

We want to use this probability directly as an outcome constraint in an optimization. To do so, one has to somehow invert the sigmoid operation

Hmm I am not sure I understand this part. Why do you need to invert the sigmoid operation?

jduerholt commented 6 months ago

Hmm, perhaps this was unclear. Let me try again ;)

The desired probability $p$ should be directly used as output constraint, because when it is 0 it means the output constraint is not fulfilled and if it is 1 then it is fulfilled, but botorch always put a sigmoid operation upon an output constraint which is totally fine for regression, as the sigmoid transforms the regression problem in some kind of classification problem (as in logistic regression) which classifies the regions in which the constraint is fulfilled and in which it is not fulfilled.

In our case, I have already this probability and do not need any sigmoid operation. For this reason, one can transform the desired probability $p$ as shown above before applying the sigmoid which results then in an identity operation, but with an extra in mathematical operations and at the risk of numerical instabilities.

Hope, this makes it clearer ...

Balandat commented 6 months ago

I think so? I guess my first idea here would be to just extract the latent (un-sigmoidified) model prediction and pass that in as the outcome together with the transformed bound. One way of doing that could be to just wrap the model. I guess I'm not sure what classification model exactly you're using if you have a pointer that would be helpful.