stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.57k stars 368 forks source link

Generalize auto-tuning in ADVI to grid search function, for auto-tuning in L-BFGS #1780

Open dustinvtran opened 8 years ago

dustinvtran commented 8 years ago

(There are a number of things I plan to generalize in the code base as I start working on the GMO implementation in Stan. I'm documenting the issues here so I don't forget.)

See title. We can use the same grid search to optimize hyperparameters for many things in Stan. The easiest extension at the moment is to make the grid search a stand-alone function, for auto-tuning the initial step-size parameter in L-BFGS.

avehtari commented 8 years ago

Is this grid search autotuning described somewhere? Maybe Bayesian optimisation could be used to improve autotuning?

dustinvtran commented 8 years ago

Just checked CmdStan and the Stan manual. It's mentioned briefly in both but not to much detail. The idea is very simple: there is a grid of possible values for the step-size hyperparameters; try each one for a fixed number of iterations; choose whichever lead to fastest increase in the ELBO after that fixed number of iterations. Instead of doing all of them and deciding after the fact, it goes through the step-size choices monotonically so that it can tell whether to stop checking all the grid values early and run with a particular setting.

re: your second question, absolutely! That's the end-goal for refactoring all these heuristics for hyperparameter selection, so that we can employ meta algorithms strictly for that task.

bob-carpenter commented 8 years ago

Are these problems monotonic? And how can it set step sizes when it's still out in the tail? The struggle we have with NUTS is to run long enough to get to the typical set where the real adaptation can kick in. The danger is if you only set step sizes in the tail regions, they can be very flat compared to the typical set and lead to overly large step sizes if set too early.

Sorry if this is all obvious!

On Mar 3, 2016, at 4:03 PM, Dustin Tran notifications@github.com wrote:

Just checked CmdStan and the Stan manual. It's mentioned briefly in both but not to much detail. The idea is very simple: there is a grid of possible values for the step-size hyperparameters; try each one for a fixed number of iterations; choose whichever lead to fastest increase in the ELBO after that fixed number of iterations. Instead of doing all of them and deciding after the fact, it goes through the step-size choices monotonically so that it can tell whether to stop checking all the grid values early and run with a particular setting.

re: your second question, absolutely! That's the end-goal for refactoring all these heuristics for hyperparameter selection, so that we can employ meta algorithms strictly for that task.

— Reply to this email directly or view it on GitHub.

dustinvtran commented 8 years ago

Stochastic optimization faces the same problem but to a lesser extent. Recall the learning rate is the quantity that multiplies the gradient and determines the magnitude at which the parameters are updated using that gradient's direction. The learning rate is a step size which decreases over time (iteration count) and guarantees convergence to an optima excluding numerical issues.

Practically, the algorithm still needs a good specification of the learning rate when it's off in the tail regions, but only in the tail region. Once it's in the relative vicinity of the posterior space, it will still bounce around a lot, proportional to the magnitude of the learning rate, but it will eventually decrease to 0. Therefore a misspecification of the constant scale factor won't be too big an issue. Ideally, we would hyperoptimize the gradient descent procedure so that we learn the optimal magnitude at which to update the parameter for each iteration. This is impractical of course, because that requires running a mini-optimization step per iteration. Therefore for now we settled with this mini-optimization at the beginning. A related notion is probabilistic line searches by Maren Mahsereci and Philipp Hennig.

We thought about related ideas for assessing convergence ,and to get more quickly to the optima without waiting for all that bouncing around to have 0 variance. It's still in the preliminary stages of research though.

syclik commented 8 years ago

@dustinvtran, is there a path forward with this issue? If not, could we close it for now?