FAQ: choosing optimizer

josef-pkt commented 10 years ago

Note update: statsmodels supports now all scipy optimizers through the scipy.optimize minimize function. The list below has not been updated. Several scipy optimizers (especially bfgs) have changed behavior across scipy versions, and new optimizer's have been added to scipy.

Which optimizer should I choose, especially when I run into problems?

some notes on the optimizers that we can choose for non-linear optimization in models see also #1649 "FAQ: examples with optimization failures" related: parameter transformation to make the optimization problem better behaved, see #1131

our current defaults that work well in cases that are not too badly behaved:

discrete: newton for fit, and slsqp for fit_regularized
generic MLE: bfgs
models with their own IRLS or other iterative solvers: GLM, RLM, GEE, ...

With the exception of basinhopping, all optimizers are local optimizers and might converge only to a local optimum.

Newton

"newton" is a simple solver implemented in statsmodels. It includes now a ridge factor to reduce problems with near singular hessian. works well for discrete models, few iterations, uses Hessian problems: breaks if Hessian is not positive definite. Doesn't have line search and might break if gradient is very large at the starting values.

Bfgs

works well and fast if we are close enough to the minimum. Finds an optimum with very small gradient. problems: can diverge or fail to converge if gradient is large, or shape of function is not "nice" during the optimization.

Nelder-Mead, `nm`

the most robust optimizer in scipy. converges even from not very good starting values Problems: takes a large number of iterations. Can get stuck in almost local minima or stops just short of minimum. Doesn't use gradient for convergence check.

slsqp

used in fit_regularized, constraint solver that works well in quadratic problems. Problems: possibly platform issues

Iteratively reweights least squares, IRLS

hardcoded optimizer in GLM and RLM. works usually well and converges in few iterations. Problem: possible convergence problem for some starting values. However, we have seen it so far only in "cooked" examples.

lbfgs

fmin_l_bfgs_b without constraints has only recently been added. Seems to converge fast and more robustly than bfgs. my impression from some examples: doesn't bring the gradient as close to zero as bfgs (which might be important when we use the score for hypothesis testing)

Others

fmin_cg: not used much possible platform and version problems, can require a considerable number of iterations.

powell: used in emplike as alternative to nm supposed to be robust, but I haven't seen much improvement compared to nm and seems to be less robust than nm. (based on a very limited set of examples)

Global

basinhopping currently the only global optimizer, recently connected, requires recent scipy. We have little experience with it so far. It seems to be working very well.

Rootfinder

used only for one-dimensional search AFAIK

fsolve: can be pretty nonrobust, maybe problems in emplike brentq: works well but requires predefined bounds expanding brentq: initial expanding search for bound and then brentq Seems to work very well, "home made" specific algorithm in power calculation and generic ppf calculation in scipy.stats

I don't have any or much experience with any of the others.

Unknown

not enough information for example Kerby's new models

kshedden commented 10 years ago

The linear mixed model (MixedLM) can be very challenging for optimization, especially if the random effects covariance matrix is nearly singular. I tried several of the scipy optimizers, and only bfgs consistently worked. I have not done a systematic study looking either at the quality of the solution, or timings. But in hundreds of simulated data sets and 4 real data sets, the strategy as currently implemented always succeeded.

The default strategy is to do one step of pure gradient descent (not a full line search, just step-halving until a downhill step is found), then switch to BFGS. scipy doesn't seem to have a pure gradient descent, so this is coded explicitly. There is also an EM option but by default this is turned off.

The entire steepest descent/EM/bfgs search is wrapped in a loop that iterates up to 10 times, attempting to get a converged result from bfgs. Thus, each time bfgs fails, it takes a few more steepest descent steps and tries again.

I'm interested in comparing this to R and Stata in terms of run time and solution quality. I have found one example so far where our approach gives a higher-likelihood solution than is found by R.

GEE does not use any scipy optimizers (or root finders, since it aims to solve score equations rather than optimize a loss function). The Gauss-Seidel iterations are implemented explicitly. It can fail, most often in Poisson models, and in models with complex dependence structures. It's often helpful to fit the model once with the independence covariance structure, then use the coefficients from this fit as starting values for the model of interest.

josef-pkt commented 10 years ago

For several MLE model we use a simplified model to get the starting values. Stata similarly switches models to get the starting point for the more complicated models. So far there is little choice for users to influence the sequence of estimators. We have the option for start_params in most models, so users can choose their own, but it's difficult to know what a good sequence is.

For NegativeBinomial fit_regularized I was experimenting with log-linear OLS, Poisson, and NegativeBinomial on constant to get a sequence of increasing difficulty for optimization. In good cases on my computer using more steps is a bit slower, but it might be more robust in general.

josef-pkt commented 10 years ago

Another issue is scaling variables: I haven't found much effect for rescaling variables for linear models, except in some extreme near-singular cases.

But it can have a large effect for non-linear optimization We rescale fit for MLE models to minimize average loss, introduced initially for fmin_slsqp in fit_normalized. And I ran into a problem with the GMM version of Poisson with one much larger exog variable, where Stata didn't have any problems at all. (related to bfgs that cannot handle huge gradients.)

josef-pkt commented 10 years ago

The entire steepest descent/EM/bfgs search is wrapped in a loop that iterates up to 10 times, attempting to get a converged result from bfgs. Thus, each time bfgs fails, it takes a few more steepest descent steps and tries again.

This might be a good idea for cases where can use different starting values or models. Try the simple optimization first, and if that fails, try harder.

If the optimization problem is relatively nice, then just doing the optimization with simple default starting values can work very well. Skipper was looking into dropping the more complicated approximate starting model for ARMA.

However, when we use just default starting values, then we leave it up to the user to find a solution for the "bad" cases. I haven't tried basinhopping yet to see whether it can find good starting values by random search in the messy cases.

josef-pkt commented 10 years ago

@kshedden @jseabold I think it would also be useful to compile a list of difficult or failing example cases. We have the corner-case tag but that is mostly for edge cases and not for cases that are just difficult to optimize. Just a list with links to examples or gists would be enough to get an overview and quickly find examples with problems.

another FAQ issue to make it easy to add comments, edit and cross-link from issues?

jseabold commented 10 years ago

FWIW, one of my intentions for the Optimizer refactor that was recently merged was to make it easier to do something like solver=[('bfgs', 10), ('nm', 15), ('newton', None)] (syntax unimportant), similar to how Stata allows explicit number of iterations for each solver. Wouldn't change any defaults (unless we want to). Maybe I can have a crack at it one night soon. Now that I'm thinking about it I think it would be easy to do I'm pretty sure.

josef-pkt commented 10 years ago

That will be very useful in many cases. For internal usage I think we can work better by conditioning on the outcome of the optimization instead of relying only on a pre-defined fixed sequence.

josef-pkt commented 10 years ago

update on fmin_cg

It looks like it might be very scale sensitive. I'm getting faster and more accurate convergence in the ProbitCG test case after standardizing the exog. #1690 #1648

example for effect of standardizing, res3 is standardized, cls_res1 is ProbitCG test case

>>> res3.model.score(res3.params * 0.0)
array([ 11.86771504,   7.23442256,  10.09196687,  -7.97884561])
>>> cls_res1.model.score(cls_res1.params * 0)
array([ -19.33274291, -146.81075919,    1.59576912,   -7.97884561])

>>> (res3.model.hessian(res3.params * 0.0)**2).sum()
1711.237383391918
>>> (cls_res1.model.hessian(cls_res1.params * 0)**2).sum()
106589340.64400685

>>> np.linalg.eigvalsh(res3.model.hessian(res3.params * 0.0))
array([-27.92937136, -20.37183272, -19.31737125, -11.95889623])
>>> np.linalg.eigvalsh(cls_res1.model.hessian(cls_res1.params * 0))
array([ -1.03242082e+04,  -6.47371961e+00,  -4.94567529e+00,
        -3.34925161e-01])

josef-pkt commented 10 years ago

fmin_powell in emplike is very slow for the confidence intervals, even slower than nm (on Windows python 32-bit)

statsmodels.emplike.tests.test_regression.TestRegressionPowell.test_ci_beta0 ... ok
statsmodels.emplike.tests.test_regression.TestRegressionPowell.test_ci_beta1 ... ok
statsmodels.emplike.tests.test_regression.TestRegressionPowell.test_ci_beta2 ... ok
statsmodels.emplike.tests.test_regression.TestRegressionPowell.test_ci_beta3 ... ok

drop powell or does it have any advantages?

statsmodels / statsmodels