Open Joshuaalbert opened 2 weeks ago
Hi Joshua, Hi Patrick,
I follow discussions on this repository and also use the optimistix version of LM quite a bit, so I'm interested in this discussion.
For 2.: If I understand correctly, then $\mu\$ is an auxiliary parameter used to find the step size. Where I do not follow is
This only needs to be done once, and thereafter
step_size
is modified following the normal approach.
Why would a line search at some initial point be helpful for modifying the step size throughout the solve? Would't this merely scale the step-size by some value if done once, and then you'd go back to multiplying with high- and low cutoffs and essentially end up with more-or-less identical $\lambda$ after a few iterations?
Maybe I am missing something :) I'd be curious how this improves the convergence rates, and if making the step size more dimensionally correct "matters" to the solver, which only sees floats.
And to your other point - re-using Jacobians would make step-size computations less accurate, right? To what extent can this be counteracted by making the step size selection criteria more stringent?
@Joshuaalbert -- thanks for getting involved!
To address your various points:
So I believe we actually experimented with this exact idea during development. (Certainly it's a standard one that one sees floating around.) I think on balance we elected not to include this -- IIRC it didn't help overall in the average case. But I think if this is important to you then I'd definitely be happy to take that as a PR, just with a default cutoff of jnp.inf
so as to preserve backward compatibility.
To make explicit what you are suggesting, you are considering multiplying this quantity:
by |grad(f)|
(or |J^T f|
in the least-squares case).
I think something like that is reasonable, although note that to be dimensionally correct I think things may need to change further, as step_size
is unitless. (It is an arbitrary scalar whose value is entirely up to the search to choose.) What would you suggest in light of that?
(I probably wouldn't introduce the initial search -- as Johanna comments this is unlikely to affect things dramatically and it would impede efficient compilation.)
On reusing the Jacobian evaluations. We actually already do this in the Gauss-Newton case: notice how we form it here in query
rather than in step
:
Indeed for our LM implementation then we elected to do the exact form in which this is computed on every step. If you wanted to implement an approximate form in which this is re-evaluated based on some criteria (even a dynamic one), then you could actually already implement this yourself without any changes to Optimistix! Implement your own descent and/or search. Depending on how this goes I'd be happy to consider upstreaming this if that was something you wanted.
Hi both, thanks for the replies. @johannahaffner the reason the step-size selection only needs to be performed once is because all it's doing is find a suitable initial dampening factor. This is a dampening parameter that would lead to successful steepest gradient descent iteration in the asymptotic case. By asymptotic case I mean this, as lambda -> inf
we have (A + lambda*I)^-1 v -> lambda^-1 v
. For those coming from SGD, lambda=1/learning_rate
in the asymptotic case, so choosing an initial step-size is kind of like choosing a good initial learning rate. The initial selection does two important things: 1) removes the necessity of the user to choose a suitable initial damping parameter, and 2) automatically ensures that the first iteration shows improvement (in most problems). This wastes fewer iterations where the solver needs to find a good initial step-size.
Hi, @patrick-kidger we could make a list of possible improvements and then do them in one fell PR.
step_size
isn't unitless. We see that in the asymptotic case, the steepest descent update is x -> x - grad / lm_param = x - step_size * grad
, so step_size
has units [x]^2/[f]^2
. Note, this is because grad=J^T.F
for LM. When you take the parametrisation that I suggest, the asymptotic update is x -> x - mu * grad / |grad|
, and mu
has units [x]
. Of course, this only makes sense when all x_i
components have the same units, but that's just a good reason to use homogeneous spaces to parametrise parameters. This is also why different problems require different initial damping parameters, because it's dependent on the scale of the problem. Choosing lm_param = |grad|/mu
has the nice property that as the stationary point is approached, |grad|
shrinks which automatically makes the system more Newton, and less sensitive to mu
.Worth mentioning, in my code I use lambda(mu) = mu * |grad|
rather than dividing so mu
has units [x]^-1
in that case.
Another thing I've wanted to do for a while is collect a large number of least-squares problems, of varying difficulty, and then do a grid search over hyper-parameters to choose defaults that lead to fastest convergence on average. WDYT?
thanks for clarifying, @Joshuaalbert :)
we could make a list of possible improvements and then do them in one fell PR
Multiple smaller PRs are much easier to review :)
step_size isn't unitless
I'd like it to be, though! I think the argument you're making here is really that we should make similar changes elsewhere -- to improve unitlessness -- beyond just the case of LM.
I'm not sure exactly how I would implement dynamic re-evaluation of J, in the case where we don't materialise the Jacobian.
So the loop of a solver is over individual function evaluations. At each step we then decide what to do with that information. C.f. the earlier discussion here:
https://github.com/patrick-kidger/optimistix/issues/89#issuecomment-2447669714
Using this as an example, this already includes a dynamic choice about when to compute the gradient -- in this case, that we are finishing one line search and starting another. You could adjust this logic to be match whatever condition you most prefer.
Another thing I've wanted to do for a while is collect a large number of least-squares problems, of varying difficulty, and then do a grid search over hyper-parameters to choose defaults that lead to fastest convergence on average. WDYT?
I think having some JAX-compatible benchmarks sounds pretty useful to me!
I'd like it to be, though! I think the argument you're making here is really that we should make similar changes elsewhere -- to improve unitlessness -- beyond just the case of LM.
Just to be clear this is not possible without some arbitrary imposition of a "default scale". Even using something like damping=step_size*diag(diag(hessian))
, which would make step_size
dimensionless, imposes an a choice of scale. You still need to find a step_size
that would work for that choice. Note, I find damping=|grad|/step_size
performs much better than damping=step_size*diag(diag(hessian))
as it automatically shrinks as you get closer to a stationary point which makes the algorithm less sensitive to step_size
.
All solvers require some knowledge of the search domain. Dimensionlessness is a property of the model, not the solver. So, if you want dimensionless step_size
, then you should parametrise your models from the unit-cube using quantiles to impose your prior knowledge. This is actually how my probabilstic programming framework, jaxns, works. The units of the support of U[0,1]
are the units of probability. E.g. if you have the distance to the Sun as a parameter, this could be a LogNormal variable, with a prior knowledge mu
and sigma
of the log-distance.
# your unconstrained variable in [-inf, inf]
unconstrained_param = ...
# any measure preserving map to [0, 1] (CDF if any prob. dist. is fine).
U = tfpd.Normal(0, 1).cdf(unconstrained_param)
# Now apply the quantile of some dist that encapsultes your prior knowledge about the variable.
param = tfpd.LogNormal(mu, sigma).quantile(U)
The solver operates on the unconstrained space, which is dimensionless. Note, I have a fast bijection [-inf, inf) -> [0,1)
here which doesn't require any transcendentals.
However, it's not a problem that any solver is inherently dimensionful. As long as you choose the right problem-specific scale. This is why an automatic determination of an initial step-size in Gauss-Newton would be so helpful to general users. I'm not sure about doing this for every solver, as not all solvers operate in the same way, so you'd need to determine what the effect of scale is on each solver and treat each specifically. Certainly, all variable metric methods that employ a line search already endeavour to find the correct scales by using some form of search. These can all be made more robust by dimensionless parametrisation, and also by ensuring the line search variable is in units of the parameter.
Sorry for long reply. I love this stuff.
Hi both :)
I think the issue of scaling is specific to regularisation.
For instance, in NewtonDescent
as used by Gauss-Newton and BFGS,
step_size
is obviously unitless.
ClassicalTrustRegion
side-steps the problem of dimensionality by taking the ratio f_diff / predicted reduction
and updating the step size parameter heuristically.
The use of the step-size as a regularisation parameter in the computation of y_diff
is then what raises the question of dimensionality.
@johannahaffner I see what you're saying there, however dimensionality and scale are two different things that are not exactly the same as regularisation (which is about adding some extra info to make ill-posed systems better-posed). There are two things at play: 1) dimensional analysis, which looks at the units of the function and units of parameters, and tracks how dimensions ripple through the analysis. This allows one to say things like, in any fixed point iteration, like x -> f(x, theta)
the units of f
are the same as x
. E.g. in classic SGD the "learning rate" has units of the inverse hessian. The simplest result of this is that often it makes things more intuitive to reason in terms of units of parameters 2) is that most algorithms are scale dependent, e.g. linear reparametrisations of the parameters lead to different algorithm behaviour. Gauss-Newton and BFGS (with inexact linesearch) for example. Ideally, we'd like this to not be the case, because then it matters if I chose to use centrimetres or millimeters for length, etc. The best sorts of optimisation algorithms are invariant to monotonic tranformations of the objective function, and linear transformations of the parameter space. But they are rare.
I take your point that dimensionality and scale are different things!
As long as we subtract $\lambda \mathbb{I}$, I would view this as a regularisation, even though (Tikhonov) regularisation parameters are likely smaller.
The point is that by subtracting lm_param
, we introduce the step size into the solve for the next step in a way we do not with other descents. So this touches on the effect of this parameter in identifying the next step.
most algorithms are scale dependent, [...] Gauss-Newton and BFGS (with inexact linesearch) for example.
Don't you mean to say that Gauss-Newton and BFGS are scale-invariant?
Introducing lm_param
, which makes the steps more gradient-descent-like, actually introduces scale-variance. Scaling the computed trust-region radius by a scalar value such as |J^T f|
would either lessen or increase that, depending on whether the norm of the Jacobian is less than or greater than one.
Is your goal to figure out what to subtract from the Hessian approximation so as to preserve scale-invariance?
I'm wondering if a little scale variance is not what we want here - and to what extent the robustness of Levenberg-Marquardt depends on being able to interpolate between two different optimisation regimes with different strengths.
Hi @patrick-kidger, as promised I wanted to help improve some of the optimisation methods in optimistix. I'd like to start with the LM implementation.
Trust region acceptance
The implemented approach has two thresholds used for determining any improvement, and sufficient improvement to warrant taking larger steps. The damping parameter is then taken to be
1/step_size
in the damped newton iteration.There are several points here:
Therefore, you should have a third cutoff sensing when
actual_reduction/pred_reduction
is sufficiently greater than one (1.1 is usually fine). In this case, accept but do not make the step more newton. Basically, only make the iterations more newton if the gain is within a region around 1.[f]/[x]^2
. For normalised least-squares it'sJ^T.J
it's[f]^2/[x]^2
, which is consistent because we normalised the equations. Anyways, choosinglambda=1/step_size
is not dimensionally correct. Much better is to letlambda = |grad(f)| / mu
(orlambda = |J^T.F| / mu
for LM). Note, the units are now correct whenmu
has units[x]
. The intuition behind this is, in the asymptotic steepest descent casex -> x - mu * grad(f) / |grad(f)|
, i.e. a step-size times the gradient unit-vector.Therefore, you can improve the damping in two ways.
i. setting
lambda = |grad(f)|/step_size
for minimisation, andlambda = |J^T.F| / step_size
for LM.ii. Choosing the initial value of
step_size
can be done by line search for a value ofmu
that leads to a reduction in the objective. This only needs to be done once, and thereafterstep_size
is modified following the normal approach. A good approach is to start frommu = |grad(f)|
and half untilx - mu * grad(f) / |grad(f)|
leads to an objective improvement. You don't need to satisfy any other particular conditions to accept the value ofmu
.Reusing J/JVP
Multi-step "approximate" LM is easily implemented by first linearising the JVPop around the current parameter and then performing one exact LM step, followed by a number of approximate steps using same JVPop. In the dense J case this is really valuable as you only form the
J
matrix once per1 + num_approx_steps
. It's also still helpful in the sparse case, wherein usingjax.linearize
is helpful. It is shown in literature to significantly reduce the amount of computation and only require a few more iterations to converge. There are simple criteria to determine whenJ
should be recomputed, however JAX precludes these dynamic decisions. Simplest is a fixed number of approximate steps per exact step.I didn't have time to attach literature, but hopefully this gets the ball rolling. I also suggest that a suite of simple but difficult benchmarks be written first to assess an improvement to the algorithm.