Improved error treatment in optimization.

leeping commented 10 years ago

Please provide input if you are interested.

When ForceBalance encounters a bad optimization step due to a noisy objective function and gradient (containing statistical properties which are noisy), it rejects the optimization step, recalculates the gradient and Hessian, and takes a reduced-size step.

However, the gradient and Hessian from the repeated step are just as noisy as the initial step. Also, the noise term remains constant as the objective function converges, so towards the end of the optimization most of the steps are bad. Many of my optimizations are terminated by hand when they take too many bad steps, or I restart the optimization with more MD steps - neither solution is optimal.

We should be able to leverage the information from multiple calculations to improve the optimization step, since they were carried out at the same values of the parameters. The simplest thing to do is average the objective function, gradient, and Hessian values at the "revisited" parameter values but I doubt this is the strictly correct thing to do.

For example, a simple calculation shows that when the property has statistical noise (X + Δ), the objective function (X + Δ - E)^2 is on average larger than the noiseless function by an amount <Δ^2>. This is obvious when the true property average agrees with experiment exactly - the objective function should be zero but it will never be zero.

When the property does not agree with experiment exactly, the objective function might be higher or lower than the true value, but on average it is higher. The variance of the objective function is something like this:

<Δ^4 + 6Δ^2(X - E)^2>

The standard error of the property Δ should probably not be used to correct our objective function because it cannot be precisely evaluated; estimates of standard errors tend to be less precise than estimates of means, so it would worsen our estimate rather than improve it.

The main point is that by averaging multiple objective function values, we don't get the same objective function that would have resulted from a single calculation with proportionally longer trajectories; we would get a lower value from the latter. But we can't compare this lower value with the next optimization step because it will have the full noise again. In fact, the "averaged objective function" from multiple steps is probably the better comparison value for future optimization steps.

What about the gradient and Hessian? Simple calculations assuming <Δ>=0 indicate that averaging the gradient over multiple calculations indeed converges to the true gradient.

More onerous are the Hessian diagonal elements. They are on average increased by a factor <Δ(∇i)^2> (i.e. the squared standard error in the ith component of the property gradient). This means that on average we are taking smaller optimization steps than we should, and this error would persist through averaging multiple Hessians. Taking too small of an optimization step is a bad thing, because they are more likely to be within the statistical noise.

So what should we do here? Possible solutions include:

1) Improving estimates of the Hessian through averaging should be done at the property calculation (i.e. "target") level, where the quantity being squared and summed is accessible - so that the Hessian diagonal elements are not inadvertently increased by the statistical noise.

2) Improving estimates of the objective function through averaging should be done at the optimization level, because they would incorporate the positive bias which subsequent optimization steps would share.

3) Each time the optimizer takes a step back and recalculates the objective function, subsequent evaluations of the objective function should increase the number of MD steps by a proportional factor. This would remove the requirement for (2), so all of the "averaging" calculations may be done at the target level. The downside is the optimization time would blow up. We might recover some efficiency by reducing bias in the Hessian diagonal elements but still...

One question is: When I repeat the liquid properties calculation for the same parameters after rejecting an optimization step, whether I need to rerun MBAR for the "doubled" collection of energies or just take their arithmetic mean. I don't think MBAR scales favorably with very large number of snapshots (I haven't run timings, just an impression from experience).

Let me know what you think. Thanks!

Lee-Ping

kyleabeauchamp commented 10 years ago

So is there some "theory" of optimizing noisy that we can defer to?

Regarding MBar, how many snapshots are we talking about? It's very possible that we could optimize MBar further. I've made some attempts at this, but never finished.

kyleabeauchamp commented 10 years ago

This may take me a couple of days to process, as I'm not super familiar with the code. IMHO Robert might have useful opinions here, as he's thought about "exploration" vs "exploitation" some.

leeping commented 10 years ago

Hi Kyle,

Thanks!

There must be a theory of noisy gradient-based optimization out there, a Google search turns up a number of things. However there is also a chance that the best solution for ForceBalance would be related to some specifics of force field optimization.

The first google hit, "Fast Probabilistic Optimization from Noisy Gradients", looks promising but it appears to learn the Hessian from gradient information alone - whereas we actually have approximate Hessian information due to it being a least-squares problem (though even for noiseless objective functions it is not the true Hessian). Also the derivations were a bit dense for me, maybe it was just the symbols.

The second google hit, "Variance Reduction for Stochastic Gradient Optimization", looks like it places a precomputed hand-selected prior on the noisy gradients - which is probably not what we want.

The MBAR calculation is 50 thermodynamic state points by ~50,000 snapshots.

kyleabeauchamp commented 10 years ago

So if you think MBar is really rate limiting, then I definitely can revisit my work on improving its speed.

leeping commented 10 years ago

I am now running some calculations which print out the MBAR weights. Since the "answer" from MBAR is the set of weights that go in front of each simulation (when estimating the property of a given simulation), if it varies by a few percent that doesn't matter much.

The noise that I'm talking about here is on the order of 50-100% (towards the end of an optimization) - if MBAR adds a few percent that isn't a big issue.

leeping commented 10 years ago

MBAR is not rate limiting at the moment. It takes a half hour on my workstation, as opposed to the batch of simulations which take up to 24 hours running in parallel. If it starts to take 8 hours I'll let you know, but that depends on (1) a larger MBAR calculation being needed, and (2) truly unfavorable scaling. :)

leeping commented 10 years ago

On a side note, a faster MBAR could be very helpful for a related project where the calculation is closer to 100 thermodynamic state points x 1 million snapshots - so I would definitely use the improved code. We can talk about that in another channel if you're interested.

kyleabeauchamp commented 10 years ago

It's on my "eventual" to do list. When that becomes a high-priority, harass me on the MBar GitHub issue tracker...

leeping commented 10 years ago

Cool. Thanks :)

leeping commented 10 years ago

One more note - when I visited the UK in 2012 I asked about noisy optimization.

At that point I remember reading about a number of stochastic optimization methods, but they were gradient-free and not very effective for parameterizing force fields.

VijayPande commented 10 years ago

I can think of stochastic methods that handle noise but not gradient-based minimization.

Naively, if I had to come up with a method here, I'd assume the way to handle this is to do some sort of smoothing, either by some kernel methods or by curve fitting. It would be fun to chat more about this.

leeping commented 10 years ago

Yes, definitely. I thought about using stochastic gradient-free minimization when we started working on the water model, but it didn't fit the bill (we have expensive objective functions and relatively inexpensive gradients / approximate Hessians.)

peastman commented 10 years ago

You should talk to Sherm about this. I bet he'll have insight on the best way to handle it.

leeping commented 10 years ago

Thanks, it's been a while since I've had a chat with Sherm. :)

leeping commented 10 years ago

I had conversations with a few folks today, and it seems this doesn't fall into the usual convex optimization problems. I might look into machine learning methods, and things like "Adagrad", "MRF"s or "contrastive divergence" - but for now I'll implement my hack.

For my initial try at implementing this, I'll have MBAR recalculate the weights for the "doubled" data set. I'll let you know if things become too costly. Thanks.

leeping commented 10 years ago

I implemented the change today. It seems to work well for the liquid bromine test case; in the case of a bad step, the simulation is repeated, and future simulation lengths are set to produce an amount of data equal to the total (i.e. doubled).

Starting with a simulation length of 0.1 ps (ridiculously small), the optimization eventually increases the simulation length to 2.0 ns, and it converges the condensed phase properties to a very tight threshold.

#============================================================#
#|  This objective function evaluation combines 2 datasets  |#
#|  Increasing simulation length: 512000 -> 1024000 steps   |#
#============================================================#
MBAR Results for Phase Point (298.15, 1.0, 'atm'), Box, Contributions:
[ 1.]
InfoContent:  20482.00 snapshots (100.00 %)
Weights have been renormalized to 1.0
Physical quantity Density uses denominator =  30.0000
Weights have been renormalized to 1.0
Physical quantity H_vap uses denominator =  0.3000
#==========================================================================================#
#|                            LiquidBromine Density (kg m^-3)                             |#
#|  Temperature  Pressure  Reference  Calculated +- Stdev     Delta    Weight    Term     |#
#==========================================================================================#
      298.15      1.0 atm  3102.800     3103.186 +- 2.971     0.386   1.00000   0.00017
--------------------------------------------------------------------------------------------
#========================================================#
#|   Density objective function:  0.000, Derivative:    |#
#========================================================#
   0 [ -9.4053e-01 ] : VDWS:opls_730
   1 [  1.1346e+00 ] : VDWT:opls_730
----------------------------------------------------------
#==========================================================================================#
#|                  LiquidBromine Enthalpy of Vaporization (kJ mol^-1)                    |#
#|  Temperature  Pressure  Reference  Calculated +- Stdev     Delta    Weight    Term     |#
#==========================================================================================#
      298.15      1.0 atm    29.960       29.975 +- 0.432     0.015   1.00000   0.00250
--------------------------------------------------------------------------------------------
#======================================================================#
#|  Enthalpy of Vaporization objective function:  0.003, Derivative:  |#
#======================================================================#
   0 [  5.0221e-01 ] : VDWS:opls_730
   1 [  1.6020e+01 ] : VDWT:opls_730
------------------------------------------------------------------------

We are still throwing away the information from the "bad" steps, though following Vijay's suggestion, they are still giving us information about the objective function / derivatives in parameter space, just not at that exact point. It would be intriguing to see how future method developments can incorporate this information.

tmartine68 commented 10 years ago

Most relevant thing I can think of is the paper by Ceperley where he deals with noisy energies from QMC. But he is doing Monte Carlo with the QMC, so this ends up meaning that the acceptance ration should be modified to account for a temperature contribution from the numerical noise (assuming the numerical noise is Gaussian). See Ceperley and Dewing, J Chem Phys v110 9812 (1999). -Todd

leeping commented 10 years ago

Thanks - I haven't thought about gradient-based optimizations as having a temperature before, but that could be a useful interpretation especially when there's noise.

leeping commented 10 years ago

Here are results for an example problem - equation of state for a 3-point water model. The starting model is TIP3P. Every time the objective function jumps up, the optimization returns to the previous step, and all subsequent simulations are doubled in length.

As you can see, the longer simulations become necessary as we need to converge the objective function further.

Eventually we reach agreement to within 0.3% across all temperatures, and the total simulation length (summing over optimization cycles) is 20.3 ns for each thermodynamic state point (so about 1 microsecond total for all state points).

noisy_page_1

noisy_page_2

noisy_page_3

leeping / forcebalance

Improved error treatment in optimization. #33