Draft:: Implement box constraints for the Nelder Mead solver, to increase its robustness on messy data where bounds are known.

johannahaffner commented 3 months ago

I'm using Nelder-Mead to fit a variance model to residuals. On real data, the solve diverges frequently. I followed the advice in #45 and implemented bounds in the options passed to the solver.

This is my first-ever pull request, and I would not be surprised if there is a much more elegant way to solve this, but I wanted to get your opinion early on in the process.

I have tested this on synthetic data and simpler test cases, and am trying it on real data right now.
I have not yet added a test case for this specific option to tests/. Should it go into test_minimise from the eponymous test module?

johannahaffner commented 3 months ago

PS: I have not touched the stats, but it might be useful to track how often the vertices get clipped to the bounds.

johannahaffner commented 3 months ago

All great points, thank you! I have a pretty busy week ahead, will sit down as I am able to and think this through :)

johannahaffner commented 3 months ago

Hi! I managed to reformulate the objective function and now it works super well even without bounds (and using a different solver). However, I would really like to be able to set bounds in LevenbergMarquardt, and potentially in other solvers too.

Do you think bounded optimization should be a per-solver implementation, or could there be a more abstract formulation that works on any pytree, and could be added to any solver?

FFroehlich commented 3 months ago

Doing this properly for trust region methods is a bit more involved. In my experience https://epubs.siam.org/doi/10.1137/0806023 is the best approach. We did some benchmarking in https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010322, which only looks at reflective strategies, not truncations. Fides also implements truncations, and from what I recall this was either worse or as good as the reflective strategies.

patrick-kidger commented 3 months ago

Do you think bounded optimization should be a per-solver implementation, or could there be a more abstract formulation that works on any pytree, and could be added to any solver?

Honestly, I'm not sure! Constrained optimization was out-of-scope when we first wrote Optimistix, and we've not revisited it since.

If there can be a general approach to it then that would be very useful. I suspect there probably is something -- if nothing else then truncating/reflecting at the end of every step is a thing you should always be able to do!

@FFroehlich -- that's super interesting to read! If we can support multiple strategies that may be best...

johannahaffner commented 3 months ago

Hi @FFroehlich and @patrick-kidger,

thank you for your input, and thank you for the references! Very interesting indeed. It makes sense that reflection should be better - seems like it has a higher chance of ending up somewhere where the direction of descent does not immediately lead back to the bounds. I'd love to take this up and implement support for bounded optimization.

Here are some thoughts:

the user should (ideally) specify the bounds with the same PyTree structure as y0
otherwise we could support a pair (lower, upper), in which case the bounds would default to jax.tree.map(lambda x: lower*x, y0)
(Raise error if neither.)
bounds can be a keyword argument bounds: dict = None, it could look something like bounds = dict(kind="clip", lower=lower).

I would define tree_clip and tree_reflect functions.

Very minor point on notation: I prefer clip over truncate, to make it more obvious that jnp.clip is used under the hood. If that would veer away too far from convention, let's go with truncate.

In terms of tree structures, anything other than vectors and parametric models I need to consider? So far I only optimized over those. (The parametric model is a small linear model in my case, implemented as an eqx.Module with a __call__ method.) I haven't used anything that has, for example, an activation function inside.

Enjoy your days!

patrick-kidger commented 3 months ago

In terms of API, I'd probably suggest making it a callable all the way, and support nothing else.

That would mean something like optx.minimise(..., move_y=...) -- although probably let's pick a different name -- where in particular move_y is actually a function that maps points from where the optimiser put them, to where they should be instead.

We can then also define helper functions like tree_clip and tree_reflect, which can be passed to this API!

Agreed on 'clip' over 'truncate'. On tree structures: we require that y0 be of type PyTree[Array] (check its type signature), so you don't need to worry about any non-arrays like activation functions.

johannahaffner commented 3 months ago

Great idea, a callable would make it much cleaner! Then stuff like checking if lower and upper are each defined can be handled inside of that, and the solvers themselves can remain blissfully ignorant of all the details, only passing a point and getting one back.

I'd propose something like that:

class AbstractBoundaryMap(...):
   ...

   @abc.abstractmethod
   def _map_with_bounds(self, y):
       """Very good description."""

   def __call__(self, y: PyTree[Array]) -> PyTree[Array]: 
       return self._map_with_bounds(y)

class ClippingBoundaryMap(AbstractBoundaryMap): 
   ...

class ReflectiveBoundaryMap(AbstractBoundaryMap): 
   ...

How about calling the callable boundary_map and have it default to None? The user would then have to define them outside of the solve, passing lower and upper similarly to how tolerances have to be specified in diffrax events.

patrick-kidger commented 3 months ago

That sounds good to me! Why have separate _map_with_bounds and __call__, though? (Just make the __call__ an abstractmethod.)

@FFroehlich what do you think?

One thing I'm not sure on is how far we want to go with this. The make-a-step-and-then-adjust-it approach is reasonable for simple problems, but in general constrained optimisation is a much larger topic than this. If we later decide that we want to implement e.g. interior point methods, is that a thing we would do without having painted ourselves into a corner here?

johannahaffner commented 3 months ago

That sounds good to me! Why have separate _map_with_bounds and __call__, though? (Just make the __call__ an abstractmethod.)

Ah right! I will do away with the private method.

And I was indeed focused on the more immediate problem of modifying the proposed y. From the abstract of the paper you shared, @FFroehlich:

Instead, a solution to a trust region subproblem is defined by minimizing a quadratic function subject only to an ellipsoidal constraint.

That does not sound like a boundary map to me, it sounds like a solver-specific option again, called in step during the computation of the next point, rather than after. We could then, on a per-solver basis:

support a boundary map (make it private and create it in init, using the specified bounds),
or something more complex, such as interior points.

Then all solvers in which simple constraints are reasonable could support these, with a shared implementation thereof. I can commit to implementing the boundary maps for clipped and reflected adjusted steps.

FFroehlich commented 3 months ago

The paper I sent uses a two-pronged approach.

On the one hand, there is the step-then-project/adjust approach to make sure that constraints are always satisfied. I agree with everything/implemented so far and also that this is likely the best approach for simple problems.

On the other hand, there is the change to the trust region subproblem. What effectively happens is that the boundary constraint also adds a transformation of the optimisation variables, which effectively results a dampening term in the quadratic subproblem, similar to what happens with the interior point approach (although you don't have any slack variables). This has the advantage that it also adds some mild regularisation to the problem, which is likely to be helpful for ill-conditioned problems, but implementation will be a bit more complex (but I see how this could be implemented rather broadly by passing a (pre-conditioning) variable transformation).

johannahaffner commented 3 months ago

Thank you for pointing that out, @FFroehlich. I gave the paper introducing Fides a deep read now :)

I think the required transformation of the optimization variables and the Hessian could be an extra method of a BoundaryMap, that is simply ignored by non-trust region solvers. I have yet to write this down in code or pseudocode and consider what would be required for interior-point methods with slack variables.

However, I also noticed this

To ensure convergence, $\Delta \theta_k$ is then selected based on the lowest $m_k(p)$ value among (i) the reflection of $p$ at the parameter boundary (ii) the constrained Cauchy step, which is the minimizer of $m_k(p)$ along the gradient that is truncated at the parameter boundary and (iii) $p$ truncated at the parameter boundary.

In each case,

$m_k(p) = f_k + \nabla f_k + \frac{1}{2} p^T \nabla^2 f_k p$ for an objective function $f$ evaluated at $\theta_k$.
$p^*$ minimizes $m_k \, s.t. \lVert p \rVert \leq \Delta_k$, with $\Delta_k$ being the trust region radius.

This sounds to me as though the solver checks i), ii) and iii) and then picks the best option. This would be an argument against separate BoundaryMaps for clipping and reflective boundaries and in favor of a single map that could implement both (+ the Cauchy, eventually).

Have you checked how i), ii) and iii) perform separately? I'm not sure how to interpret Fig3.D and E from the Fides paper.

Do you expect strong performance differences between the solver picking among the three options, compared to sticking to a single option during a solve?

FFroehlich commented 3 months ago

Thank you for pointing that out, @FFroehlich. I gave the paper introducing Fides a deep read now :)

I think the required transformation of the optimization variables and the Hessian could be an extra method of a BoundaryMap, that is simply ignored by non-trust region solvers. I have yet to write this down in code or pseudocode and consider what would be required for interior-point methods with slack variables.

Yes, sound that sounds reasonable.

However, I also noticed this

To ensure convergence, Δθk is then selected based on the lowest mk(p) value among (i) the reflection of p∗ at the parameter boundary (ii) the constrained Cauchy step, which is the minimizer of mk(p) along the gradient that is truncated at the parameter boundary and (iii) p∗ truncated at the parameter boundary.

In each case,

mk(p)=fk+∇fk+12pT∇2fkp for an objective function f evaluated at θk.

p∗ minimizes mk,s.t.‖p‖≤Δk, with Δk being the trust region radius.

This sounds to me as though the solver checks i), ii) and iii) and then picks the best option. This would be an argument against separate BoundaryMaps for clipping and reflective boundaries and in favour of a single map that could implement both (+ the Cauchy, eventually).

Have you checked how i), ii) and iii) perform separately? I'm not sure how to interpret Fig3.D and E from the Fides paper.

Do you expect strong performance differences between the solver picking among the three options, compared to sticking to a single option during a solve?

That's where things get more complicated. It's probably helpful carefully read https://link.springer.com/article/10.1007/BF01582221. From what I recall, they include the (truncated) gradient step ((ii) above) as it is necessary to prove convergence. In contrast to (iii), which rescales all elements of *p, the truncation strategy in Fig. 3 D/E is actually an element-wise truncation at the boundary (replacing (i)), just for the parameters that hit the boundary. I don't think there is any theoretical justification why to include (iii), it is more of a practical thing that the corresponding step is anyways computed.

We simply stuck with the implementation that fmincon/lsqnonlin and ls_trf were using in order to be able to compare to those methods. My impression is that overall, there are a lot of empirical choices in these algorithms with little evidence that they are in any way optimal. Also contemporary optimisation problems likely have distinct characteristics than typical benchmarks in the 90s. You could try to just use a single option. You could try to only use a reflective/truncation strategy. I would expect that both work reasonably well as optimisation close to boundaries is just generally difficult, even with the discussed strategies.

If you are interested in trying some things out, we would have everything in place to easily run optimistix on the PEtab benchmark collection.

johannahaffner commented 3 months ago

If you are interested in trying some things out, we would have everything in place to easily run optimistix on the PEtab benchmark collection.

100 %! Amazing, I didn't know that interfacing to the PETab benchmarks was already possible.

I'll update with more detail on your other remarks, once I have been able to give them thorough consideration. Thank you for the thoughtful reply!

johannahaffner commented 2 months ago

Just a brief update: I have not forgotten about this :)

I did read the papers you sent, @FFroehlich. I then branched out into more general literature on constraint based optimisation, consulting the Nocedal & Wright in particular. I took that route because box constraints are just a special case of inequality constraints. And if we don't want to box ourselves in (pun intended), I think it is useful to have a bit of an overview.

I have some things to wrap up in the coming two weeks, but plan to return to this topic then - I think there is potential for some useful abstractions here, since a lot of common ingredients are repeated across different approaches. (It do agree that it seems as though people make some empirical choices there - and then run with them if convergence can be proven. Optimality of one approach among many is much harder to show, as @FFroehlich has already pointed out.)

patrick-kidger / optimistix

Draft:: Implement box constraints for the Nelder Mead solver, to increase its robustness on messy data where bounds are known. #64