patrick-kidger / optimistix

Nonlinear optimisation (root-finding, least squares, ...) in JAX+Equinox. https://docs.kidger.site/optimistix/
Apache License 2.0
265 stars 12 forks source link

Classical newton methods #42

Closed ataras2 closed 4 months ago

ataras2 commented 5 months ago

Hi! Thanks for making this package, I'm finding really helpful for some optimisation problems I'm solving. I was wondering if there is a Newton method that uses jax.hessian or similar to minimise a function? I've seen bfgs, but for problems in <100 dims finding the hessian is fairly fast and guaranteed quadratic convergence is something I'd really like. My own few line implementation (below) supports this. I'm also wondering if there are line searches to adapt the Newton step (e.g. backtracking, exact line search). Happy to help merge this in if you are interested

while n_iter < max_iter and np.linalg.norm(g) > tol:
  n_iter += 1
  H = jax.hessian(lambda x: newton_loss_func(x, args))(x)
  g = jax.grad(lambda x: newton_loss_func(x, args))(x)

  # line search - to be upgraded to use a proper line search 
  t_vals = np.linspace(-2, 0, 101)
  loss_vals = jax.vmap(lambda t: newton_loss_func(x + t*np.linalg.inv(H)@g, args))(t_vals)
  t = t_vals[np.argmin(loss_vals)]

  x = x + t*np.linalg.inv(H)@g

  new_losses.append(newton_loss_func(x, args))
patrick-kidger commented 5 months ago

~This should be what you get when doing optx.minimise(..., solver=optx.Newton()). :)~ (EDIT: wrong way around, whoops, see below.)

On line searchs: yup, we support a few of these. In particular backtracking.

ataras2 commented 5 months ago

My understanding was that optx.Newton() was a root finder, not a minimiser (though I guess minimising = root solving for the gradient function...). I'm looking for something like this - there are almost too many Newton algorithms*

Minimum example below showcases some of my confusion - works with BFGS solver but not with Newton (since the function has no root but a minimum).

import optimistix as optx
import jax.numpy as np

gamma = 10

def fn(x,args):
    return x[0]**2 + gamma*x[1]**2 + 5

# solver = optx.Newton(1e-3,1e-3)
solver = optx.BFGS(1e-3,1e-3)
res = optx.minimise(fn, solver, y0=np.array([1, 1]), max_steps=int(1e4))

print(res.stats["num_steps"], res.value)

Though I now also see that this works:

import jax
solver = optx.Newton(1e-3,1e-3)
res = optx.root_find(jax.grad(fn), solver, y0=np.array([1, 1]), max_steps=int(1e4))

print(res.stats["num_steps"], res.value)

*please eliminate three. P.S. I am not a crackpot

packquickly commented 5 months ago

We deliberately kept Newton as a root-finder and not a general-purpose minimisation algorithm. Newton's method for minimisation refers to the minimiser you get by applying Newton's method for root-finding to the gradient of the function, as you've done in your example. This means for non-convex problems Newton is happy to converge to any critical point, and not just minima. We've excluded it to deter people from using it without knowing this, and to encourage use of quasi-Newton methods instead. Quasi-Newton methods converge to non-minimal critical points less frequently.

If you want to apply Newton's method regardless, you can apply it to the gradient as you've done. Unfortunately this doesn't support line searches. You could implement a custom Newton solver with line search quite easily by subclassing AbstractMinimiser and providing the true Hessian in the function info. Take a look at the code for any of the existing minimisers for how to do this, and let me know if you run into trouble.

ataras2 commented 4 months ago

That all makes sense, thanks for clarifying! Would Newton solvers that exploit structure (perhaps at the cost of some ease of use) be in scope too? or maybe these are already out there and I'm missing them? e.g. solver for a min. problem with Hessian as arrow matrix. In my mind this could involve reformatting the loss function from fn(x) to f(sparse_params, dense_params) and providing a function that can invert the Hessian of the sparse params using the structure.

patrick-kidger commented 4 months ago

I think this should be doable fairly straightforwardly.

First of all, as a similar example for what is already supported, here's a root-find using Newton's method and a diagonal Jacobian. (Which takes the place of the Hessian when doing a root-find rather than a minimisation.)

import lineax as lx
import optimistix as optx
import jax.numpy as jnp
from jaxtyping import Array, Float

def fn(x: Float[Array, "2"], _) -> Float[Array, "2"]:
    x0, x1 = x
    return jnp.stack([x0**2, (x1 - 3)**2])

solver = optx.Newton(rtol=1e-5, atol=1e-5, linear_solver=lx.Diagonal())
y0 = jnp.array([1., 2.])
sol = optx.root_find(fn, solver, y0, tags=frozenset({lx.diagonal_tag}))
print(sol.value)

Here you can see that we use Lineax to handle linear operators and linear solves.

So in order to support the example you have in mind, you would need to:

  1. Add an "arrow tag" to indicate the structure of your matrix: arrow_tag = object(). (Akin to the lx.diagonal_tag above.)
  2. Create an Arrow solver that knows how to solve matrices with that structure. (Akin to the lx.Diagonal above.)

Neither of these requires any changes to Lineax itself, as every built-in solver is defined in the exact same way as one that is defined by a user.

Finally, you can of course do your minimisation by doing a root-find on the gradient, as discussed above. (Which is still generally discouraged as @packquickly notes, but that's up to you. :) )

ataras2 commented 4 months ago

Thanks, I'll be sure to have a closer look at Lineax too!