BFGS behaviour change between 0.9 & 0.11?

davidia commented 4 years ago

Hello, I was using the bfgs optimizer successfully with version 0.9. I upgraded to 0.11 and now it gives up after 1 or 2 iterations where as is used to solved the problem with a few hundred previously. I have been through the changes and the only thing I could find was some differences in the hager_zhang algorithm. I have all settings as default apart from providing an inverse hessian estimate. Unfortunately it's not easy for me to share my problem but any suggestions would be gratefully received.

axch commented 4 years ago

I recently reworked that code to make it somewhat more robust to infinite function values, so may have inadvertently introduced a regression. Does the function you are optimizing ever return +inf, -inf, or nan values? I would expect behavior around those to have changed, though the intent was that it would change for the better.

My changes did have the effect of slightly tweaking the function query pattern in the line search; perhaps it is now falling into a local optimum it used to avoid, or hitting a region of numerically extreme behavior it used to miss?

I don't know that I can say much more than that without more information; is there anything more you can share about your problem and the failure you are experiencing?

davidia commented 4 years ago

Thanks for the insight. So I have ran 0.9 and 0.11 side-by-side and it's clear my function could be better behaved! However, after recording the invocations on the function, what I see is at one point my function returns nan for the the value and gradients at this point 0.11.0 stops whereas 0.9.0 plows on and eventually gets to the answer.

What's also clear is that the initial step size in the line search is way too big for my function, can that be tuned or is it just a case of changing the function somehow?

axch commented 4 years ago

Unfortunately, neither the 0.9 nor the 0.11 APIs allow one to adjust the parameters of the line search, though that's an oversight. File a separate issue for it? It should just be a matter of plumbing through a dict of kwargs, so if you were inclined to just do it, we would welcome a PR as well.

As to the nan value: semantically, I would argue that if your function returns a nan, there isn't a well-defined minimum, because the nan value stands for "this computation is so inaccurate it could be literally anything, including smaller than all other values this function takes". In this sense, the 0.9 behavior of ignoring the nan and continuing looks to me like an error.

That said, what can you do about it? If, in your case, you want a minimum among the non-nan values, you could try masking the nan:

# Might also need to play tricks with the gradient in order to mask nans there
def new_f(*args, **kwargs):
  raw_answer = f(*args, **kwargs)
  return tf.where(tf.isnan(raw_answer), +1e50, raw_answer)

(assuming you know that all finite values of f are smaller than 1e50).

You can also artificially re-scale your function so that the initial steps the line search tries are small enough

def scaled_f(x):
  return f(x / 100)

though I don't know what that will do to the rest of your optimization landscape.

Finally, the high-achiever approach would be to re-code f to compute accurate values instead of nan, though that can be an unbounded amount of work. Might have knock-on benefits though -- if f(3) = nan, then f(2.95) is probably pretty inaccurate, and it might be worth your while to fix that.

Oh, and if your nans come from invalid parameters (like a standard deviation < 0, say), you could try optimizing in unconstrained space:

# If f requires x > 0
def unconstrained_f(x):
  return f(tf.softplus(x))

min = tfp.math.softplus_inverse(bfgs(unconstrained_f))

Again, don't know what that will do to your landscape, local optima, etc, but it might be something to try.

davidia commented 4 years ago

Yes totally agreed about the nan and thanks for all the suggestions. I have an exp in there so it's quite easy to get a nan if you take too a big step, the scaling trick works well and it now converges quickly with no nans. I don't know if you have tested against scipy but your implementation works way better for my problem at least. Unfortunately I'm not in a position to do a PR at the moment, my apologies. Thanks again for your help and the great work.

axch commented 4 years ago

Sounds like your issue is resolved? I pulled out the request to control the line search as #1043; other than that, I think we agree that the behavior in 0.11 is basically working as intended, so I'm going to close this. Please reopen (or start a relevant new issue) if a problem remains.

tensorflow / probability

BFGS behaviour change between 0.9 & 0.11? #1034