patrick-kidger / optimistix

Nonlinear optimisation (root-finding, least squares, ...) in JAX+Equinox. https://docs.kidger.site/optimistix/
Apache License 2.0
267 stars 12 forks source link

New line searches #1

Open patrick-kidger opened 9 months ago

patrick-kidger commented 9 months ago
HeSchatz commented 3 months ago

Hey there. I gave implementing the zoom line search algorithm a try, as I find it to be a generally well-working method. At least for some of the problems I have tackled in the past.

The main obstacle (apart from getting used to the JAX control flow for nested if-expressions) is how to retrieve the descent information needed to evaluate the decrease and curvature conditions as well as interpolate the step sizes. In the current implementation of the search and descent interfaces there seems to be no general way to retrieve the descent direction after the descent step is made. Also, the descent methods currently have no common naming scheme regarding the descent direction, if it's even calculated explicitly (I get that this may not alway be necessary), which further complicates things. My understanding is that one would have to divide the current step by the step size to (re-)evaluate the descent direction, which may generally be a bad idea for a value that is supposed to decrease with further convergence. Or am I missing something obvious?

If not, my proposal would be to modify the interface of the descent methods in a way that allows to retrieve the current descent direction in a unified way. Also, I think the abstract line search step method should have an additional argument for the descent direction or even the descent state, as this is most likely required by complex line search methods in one way or another.

On a side note, I'm also working on implementing the L-BFGS algorithm (kind of finished, at least from a methodological point of view), which also relies on common interfaces for descents.

Looking forward to further discuss this topic!

packquickly commented 3 months ago

Hello! Thanks for the interest in implementing some new line searches.

Just to be clear, for a minimisation problem of a function $f\colon \mathbb{R}^n \to \mathbb{R}$, an Optimistix descent $\delta\colon \mathbb{R}_+ \to \mathbb{R}^n$ is a curve in $\mathbb{R}^n$ parameterised by the step-size. Line searches handle the inner optimisation loop of choosing $\alpha$ such that $f(y_k + \delta(\alpha))$ is approximately minimised with respect to $\alpha$. We only do the classical line-search algorithm $f(y_k + \alpha p_k)$ for a fixed descent direction $p_k \in \mathbb{R}^n$ when using certain descents, such as NewtonDescent and SteepestDescent. We don't have a common name/interface for the descent direction $p_k$ because most the descents don't compute a quantity like $p_k$ at all, so it's intentionally not supported. It also shouldn't be needed for the Wolfe conditions (or related conditions.)

$\delta(\alpha)$ is computed with the step method on any descent (see the abstract base class here.) The way this gets passed to the line search is via the y_eval argument in the line search step method. y_eval is the proposed next step for a given step-size $\alpha_k$. For a step-size $\alpha_k$ the actual diff $\delta(\alpha_k)$ can then be computed by subtracting y_eval from y, see the BacktrackingArmijo line search for an example.

As for technical details of implementing the Wolfe conditions, there should be no issues here. The classic Wolfe conditions for current iterate $y_k \in \mathbb{R}^n$, proposed step-size $\alpha_k \in \mathbb{R}$ and descent direction $p_k \in \mathbb{R}^n$ are the sufficient decrease condition $$f(y_k + \alpha_k p_k) < f(y_k) + c_1 \alpha_k p^T \nabla f(y_k)$$ and the curvature condition $$\nabla f(y_k + \alpha_k p_k)^T p_k \geq c_2 \nabla f(y_k)^T p_k$$ for some constants $c_1, c_2$. We can replace the traditional line-search step $\alpha_k p_k$ with the generalised descent $\delta(\alpha_k)$ $$f(y_k + \delta(\alpha_k)) < f(y_k) + c_1 \delta(\alpha_k)^T \nabla f(y_k)$$ and the curvature condition $$\nabla f(y_k + \delta(\alpha_k))^T \delta(\alpha_k) \geq c_2 \nabla f(y_k)^T \delta(\alpha_k)$$ without changing the meaning of either of the conditions. When $\delta(\alpha) = \alpha p_k$ these are equivalent to the classical Wolfe conditions. Different relaxations of the Wolfe conditions should be generalisable in roughly the same way.

Does this clear things up?

HeSchatz commented 3 months ago

Thank you for taking the time to write down the general line search problem and current implementation in such detail. This actually helps a lot to clarify the problem I'm currently facing.

First, you are totally right about the Wolfe conditions. I understood that the descent information is computed by substracting y from the current evaluate y_eval, which gives

$\delta(\alpha_k) = \alpha_k p_k $

Evaluating the sufficient decrease condition with $\delta(\alpha_k)$ is straight forward, but I did not notice that multiplying the gradient with $p_k$ or $\delta(\alpha_k)$ on both sides of the curvature condition is equivalent.

The problem still persisting is the computation of new trial step sizes $\alpha_{k+1}$ in the zoom phase of the line search algorithm suggested in Nocedal, Wright (2006) - Numerical Optimization, 2nd ed. (algorithm 3.6 and 3.5, respectively).

They propose computing new trial step sizes $\alpha{k+1}$ by minimizing the cubic function interpolating $\phi(\alpha{k-1})$. $\phi'(\alpha_{k-1})$, $\phi(\alpha_k)$ and $\phi'(\alpha_k)$. The new step size can then be computed by:

$\alpha_{k+1} = \alpha_k - (\alpha_k - \alpha_{k-1}) \left[ \frac{\phi'(\alpha_k) + d_2 - d_1}{\phi'(\alpha_k) - \phi'(\alpha_{k - 1}) + 2 d_2} \right]$ with

$d_1 = \phi'(\alpha_{k-1}) + \phi'(\alpha_k) - 3 \frac{\phi(\alpha_{k-1}) - \phi(\alpha_k)}{\alpha_{k-1} - \alpha_k}$

$d_2 = \left[ d_1^2 - \phi'(\alpha_{k-1}) \phi'(\alpha_k) \right]^{1/2}$ This requires explicit computation of the derivative

$\phi'(\alpha) = \nabla f(y_k + \alpha p_k)^T p_k = \nabla f(y_k + \delta(\alpha))^T \delta'(\alpha)$ of

$\phi(\alpha) = f(y_k + \alpha p_k) = f(y_k + \delta(\alpha))$ which requires computation of the descent direction $p_k$. At least I could not come up with a way to evaluate the minimizer without doing so. Please let me know if you have any suggestions. Maybe I'm once again misleaded by the usual notation. Edit: fixed some typos in equations

packquickly commented 1 month ago

Yep, this looks like it may require a more significant rewrite. As you pointed out, the issue is how to get $\delta'(\alpha)$ without access to $\delta$.

The most obvious solution to this may be to pass the descent step method and the descent state to each line search, either through some very simple custom type or capture the state via closure and pass just the scalar function lambda alpha: descent.step(alpha, descent_state). This is a breaking change since it alters the base class AbstractSearch, but these kinds of line searches should be well-supported in Optimistix and I don't see a non-breaking way to support them. @patrick-kidger do you have any strong thoughts on this? If not I can draft either of these methods in a PR.

@HeSchatz I've already left you in the dark on this for way too long, my apologies. While determining how much of a rewrite this will need, a clunky option which at least gets the autodiff issue out of your way would be to use finite difference. You should be able to compute and store $\delta(\alpha + h)$ and $\delta(\alpha - h)$ using step-rejection and the search state and use the central finite difference approximation. Although this may be a case where it actually makes more sense to just use forward finite difference instead of central finite difference to avoid the extra function call at the cost of some accuracy.

patrick-kidger commented 1 month ago

I think I'm happy with a breaking change. We're still early in this project's lifecycle, and this is already an advanced API.