manual chapter on optimization

bob-carpenter commented 10 years ago

From Marcus on stan-users in response to a model and question from David C. Cohen:

The optimal solution (from nnls) is very sparse, i.e., all but maybe a dozen coefficients are zero. This solution is going to be very hard (if not impossible) for Stan to reach using the parameterization (using lower=0) that you used in the basic model. This is because in the unconstrained domain, a parameter needs to effectively get to infinity for it to be zero. Worse though is that it is definitely possible for it to converge to bad solution due to numerical issues. This happens when the unconstrained parameter x becomes sufficiently negative that exp(x) underflows and is numerically zero. The problem is that then the derivative wrt x becomes zero and this parameter will never change again. So, if the optimizer over shoots for a parameter, it can get stuck in a bad place. Worse, this can't be easily diagnosed because it's gradient will appear to be zero!

To summarize for users: If you are optimizing and expect your solution to end up on a constraint boundary, do not use Stan's "lower=" or "upper=" constraints. They will almost certainly result in failures and it's likely that it there will be no obvious indication that it's failed.

Using a squared or fabs parameterization avoids this issue for the most part. You can't use init=0 in this case because the gradients are all zero there, but otherwise it will work better. However for it to have a chance of working, you need to normalize your data a bit. Right now the scale on the data is huge which is blowing out the gradients and making it difficult for the optimizer to make initial progress. You should normalize your data to have a std dev of 1. Note that this can even be done in the transformed data block of the Stan model to make it easier.
This model is sufficiently large that you'll want to use LBFGS instead of BFGS. On my machine LBFGS is about 40 times faster for your data. As an added bonus, because LBFGS is often better at adapting to curvature changes, it actually gets to a better optimum for me.
Also, be aware of the convergence criteria. They are set to values which are reasonable but certainly may need to be adjusted for different problems. If the optimizer is stopping but not at the right solution, consider adjusting (or turning off) some of the convergence tests. E.G., for your data and model I found turning off the relative convergence tests to be useful. See also normalizing your data.
The model could be made a bit faster by using squared_distance() instead of dot_self(). (Not a huge speedup, but it will help a little bit.)

Play around a bit with normalizing your data, using the squared() parameterization and LBFGS and let us know how it goes.

jgabry commented 8 years ago

Does this still need to get into the manual or is the Optimization Algorithms section sufficient? It touches on some of this (albeit briefly).

bob-carpenter commented 8 years ago

Use your judgment.

bob-carpenter commented 7 years ago

Moved into #2122

stan-dev / stan

manual chapter on optimization #793