statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
9.96k stars 2.86k forks source link

mixedlm runs indefinitely without producing output #4847

Open epierson9 opened 6 years ago

epierson9 commented 6 years ago

I am fitting a mixed model with about 10,000 groups, eg: import statsmodels.formula.api as smf model = smf.mixedlm('Y ~ cov_1 + cov_2 + cov_3', data, groups = data['grouping_variable']).fit()

for most substratifications of my data this works fine and takes about 2 minutes to run. But occasionally, it runs indefinitely (for hours) without producing any output. I assume there is some idiosyncrasy of the data which is causing this but I'm not sure what it might be. I'm running version 0.8.0.

kshedden commented 6 years ago

A couple of options:

Try using method='lbfgs' or method='cg' in MixedLM.fit, or as last resort try method='powell' (sometimes Powell seems to run successfully and claim converge but the gradient isn't really that small).

There is a pending PR (#4819) that will probably be merged within a few days that gives more control over the optimization options, e.g. you can use method=['bfgs', 'lbfgs', 'cg', 'bfgs'] to try the three optimizers in sequence.

You can also try something like maxiter=10 infit and check to see if it is stuck at a nan or infinite point.

Beyond this we would need to see your data to better understand what the issue might be. Some possible issues would be singular model.exog. This could be ruled out by running the same model as OLS (omitting the groups argument).

I have seen it error out and I have seen it return at a nan point, but I don't think I have ever seen it get stuck for hours without giving an error or warning, so this is something new.

kshedden commented 6 years ago

From an optimization point of view, this model is only a 1 parameter model, since the fixed effects and error variance are profiled out. So even more strange that this is happening...

epierson9 commented 6 years ago

Thank you! This is very helpful. I tried a couple of your suggestions:

  1. Fitting the mixed model with fit(method='lbfgs') sometimes errors out with an "numpy.linalg.linalg.LinAlgError: Singular matrix" error. Fitting with method='cg' doesn't seem to error out but returns "ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals" sometimes -- maybe this is ok?
  2. Fitting with OLS using a fixed effect for the group, eg: model = sm.OLS.from_formula('Y ~ cov_1 + cov_2 + cov_3 + group', data).fit() (the groups are strings in my data -- hopefully this is ok) yields the error "numpy.linalg.linalg.LinAlgError: SVD did not converge".
  3. Fitting OLS without the group fixed effects, eg model = sm.OLS.from_formula('Y ~ cov_1 + cov_2 + cov_3', data).fit() does not seem to yield errors.

Possibly it's some idiosyncrasy of the data -- Y is binary in many of my tests, which seems like it could cause problems, and some of the groups have small numbers of observations? The group variable is also encoded as strings, which I assumed was ok.

kshedden commented 6 years ago

Do you have any NaN's or Inf values in your data? I don't think missing values are dropped automatically by MixedLM, but if you use missing='drop' in your model construction they should be (for a long time MixedLM ignored the missing argument in class construction, but this was fixed about a year ago). The other option is just to call dropna on your data frame before passing it into the model.

If your covariates are on very different scales, it might be a good idea to standardize them.

MixedLM is continuously being developed. What version of statsmodels are you using? The Github master may have some small improvements over the 0.9 release, and the 0.9 release should be much better than earlier releases (at least in terms of MixedLM).

How many distinct values does the group variable have, and what is the overall sample size?

I think it is worth pursuing what happened with your OLS fit when you included group as a factor. It is fine that the group variable contains strings, this just means that it will be converted to a factor, which is what you want. Non-convergence of the SVD is fairly rare, but can result from Inf/NaN values or other weird scalings in the data. I think there is a clue there.

epierson9 commented 6 years ago

No NaNs or Infs in the data. Covariates should all be binary (they are categorial variables). The dependent variable is also binary -- maybe that is causing problems, since it doesn't fit the OLS assumptions?

I'm using statsmodels version 0.8.0. Group variable has ~1e4 distinct values, and the data has 1e5-1e6 rows total.

epierson9 commented 6 years ago

I will try upgrading statsmodels to 0.9 and see if that improves things.

kshedden commented 6 years ago

The binary outcome would raise big concerns about the statistical validity of your results, but should not on its own create numerical convergence issues. I would still love to get to the bottom of this using MixedLM, but your research aims would be better met by a mixed GLM or GEE. We have a fairly mature GEE, and an early prototype of mixed GLM called BinomialBayesMixedGLM. You would want to get the latest master off of github if you want to use mixed GLM since we have been actively working on this code recently.

With 10^4 groups, I don't think the fixed effects OLS would work (it would require a lot of memory, that is perhaps what is killing the SVD). So this is probably not a productive direction to pursue further.

epierson9 commented 6 years ago

I'll let you know how 0.9 does fitting the MixedLM.

The mixed model is just a robustness check on my main results, and I wanted to use a simple mixed model, not a mixed GLM, for interpretability -- I wanted to be able to interpret the fixed effects as changes in the probability the dependent variable Y=1, not as odds ratios, analogous to a linear probability model. But maybe this isn't ok to do with mixed models? What are your concerns about statistical validity?

kshedden commented 6 years ago

I see. If you are wanting to fit a linear probability model this might be OK, at least for the point estimation (standard errors may not perform too well though). Regardless of statistical validity this model still should fit, so it would be great to get past the computational issue. 0.9 is much improved over 0.8 (assuming you are using released versions), and there are continuing improvements on Github master. Hopefully this will be enough to get this going.

kshedden commented 6 years ago

If you continue to have trouble, please try something like below

model.score(result.params_object)

where model, result are your MixedLM model and result after fitting. This should give you the gradient vector at the end of the optimization, and can allow you to see how close you are to converging.

epierson9 commented 6 years ago

Interesting, thanks! Upgrading to 0.9 did not remove the warning: "ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)" ~ although it still runs without errors. When I run model.score(result.params_object) I get values that are, in absolute value, <=3. Not sure what a reasonable value is / what the scale of that number is.

epierson9 commented 6 years ago

Other than that the results seem reasonable to me. Specifically, the fixed effects estimates are pretty similar to what you get when you just remove each group's mean and just do a simple linear regression on the same covariates.

kshedden commented 6 years ago

Is the variance parameter (for your 10K groups) approaching zero? If so, the MLE may be on the boundary, and we do not expect the score vector to get very small. For a non-boundary MLE, a norm of ~3 is not very small. A random intercepts model after profiling should only have 1 parameter left, so the score vector you get with the call to model.score(result.params_ object) should be a single number. If this is not the case, then I am confused about the structure of your model.

On Fri, Aug 10, 2018 at 8:35 AM epierson9 notifications@github.com wrote:

Other than that the results seem reasonable to me. Specifically, the fixed effects estimates are pretty similar to what you get when you just remove each group's mean and just do a simple linear regression on the same covariates.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/4847#issuecomment-412069134, or mute the thread https://github.com/notifications/unsubscribe-auth/ACiww9hjbH7MT5_D89Iuyd1cTY1QbLkbks5uPX4AgaJpZM4Vy68q .

epierson9 commented 6 years ago

Yes, the variance parameter is generally quite small (< .1) because the dependent variable is binary. I just ran a few more tests, and it seems like some of the models are still running indefinitely without producing output. I may try fitting the model using lme4 as a comparison, and if that doesn't work, give up. (Out of curiosity, why does it not seem possible that model misspecification -- the dependent variable being binary -- is causing fitting problems? I am not familiar with the fitting procedure.)

On Fri, Aug 10, 2018 at 2:07 PM, Kerby Shedden notifications@github.com wrote:

Is the variance parameter (for your 10K groups) approaching zero? If so, the MLE may be on the boundary, and we do not expect the score vector to get very small. For a non-boundary MLE, a norm of ~3 is not very small. A random intercepts model after profiling should only have 1 parameter left, so the score vector you get with the call to model.score(result.params_ object) should be a single number. If this is not the case, then I am confused about the structure of your model.

On Fri, Aug 10, 2018 at 8:35 AM epierson9 notifications@github.com wrote:

Other than that the results seem reasonable to me. Specifically, the fixed effects estimates are pretty similar to what you get when you just remove each group's mean and just do a simple linear regression on the same covariates.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/4847#issuecomment- 412069134, or mute the thread https://github.com/notifications/unsubscribe-auth/ACiww9hjbH7MT5_ D89Iuyd1cTY1QbLkbks5uPX4AgaJpZM4Vy68q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/4847#issuecomment-412161555, or mute the thread https://github.com/notifications/unsubscribe-auth/ACKqGFhC6HfijeHbdKWk_r8xGJx-O-Zgks5uPcvRgaJpZM4Vy68q .

kshedden commented 6 years ago

The MLE is fit using an optimization process, and the objective function is not necessarily any harder to optimize just because the model is misspecified. But the optimizers being used are only guaranteed to work well if the limit point (local maximum) is in the interior of the parameter space, not on the boundary. My guess is that the situations where the optimizer is struggling have their MLE on the boundary.

0.1 is not that small however. Also, even though there are no guarantees that things will work well when the MLE (of a variance parameter) is zero, usually the optimizer will converge to the boundary without a problem.

So there is still something unusual about your data.

kshedden commented 6 years ago

Below is a simulation that tries to mimic what you have. Using the latest master, it converges when method='bfgs' (default). The MLE is very close to the boundary (groups variance is around 10^-12). But the gradient is still small and the optimization succeeds under the usual criteria. So there must be something else going on with your data that is causing the convergence issues.

import numpy as np                                                                                                                           
import pandas as pd                                                                                                                          
import statsmodels.api as sm                                                                                                                 

n_grp = 10000                                                                                                                                
grp_size = 5                                                                                                                                 
n_tot = n_grp * grp_size                                                                                                                     

groups = np.kron(np.arange(n_grp), np.ones(grp_size))                                                                                        

df = pd.DataFrame({"groups": groups})                                                                                                        

df["x1"] = np.random.normal(size=len(groups))                                                                                                
df["x2"] = np.random.normal(size=len(groups))                                                                                                
df["x3"] = np.random.normal(size=len(groups))                                                                                                

df["y"] = df.x1 + df.x2 + df.x3 + np.random.normal(size=n_tot)                                                                               

df["y"] = (df.y > 1).astype(np.int)                                                                                                          

model = sm.MixedLM.from_formula("y ~ x1 + x2 + x3", groups="groups", data=df)                                                                
result = model.fit(method="lbfgs")                                                                                                           

print(model.score(result.params_object))                                                                                                     
epierson9 commented 6 years ago

Hm, interesting! Thanks for the simulation. If it helps, when I try fitting the model in R's lme4, it mostly fits fine, but occasionally I get a warning:

checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv Model is nearly unidentifiable: very large eigenvalue

Maybe this gives some clue as to what might be going on?

On Sat, Aug 11, 2018 at 5:48 PM, Kerby Shedden notifications@github.com wrote:

Below is a simulation that tries to mimic what you have. Using the latest master, it converges when method='bfgs' (default). The MLE is very close to the boundary (groups variance is around 10^-12). But the gradient is still small and the optimization succeeds under the usual criteria. So there must be something else going on with your data that is causing the convergence issues.

import numpy as np import pandas as pd import statsmodels.api as sm

n_grp = 10000 grp_size = 5 n_tot = n_grp * grp_size

groups = np.kron(np.arange(n_grp), np.ones(grp_size))

df = pd.DataFrame({"groups": groups})

df["x1"] = np.random.normal(size=len(groups)) df["x2"] = np.random.normal(size=len(groups)) df["x3"] = np.random.normal(size=len(groups))

df["y"] = df.x1 + df.x2 + df.x3 + np.random.normal(size=n_tot)

df["y"] = (df.y > 1).astype(np.int)

model = sm.MixedLM.from_formula("y ~ x1 + x2 + x3", groups="groups", data=df) result = model.fit(method="lbfgs")

print(model.score(result.params_object))

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/4847#issuecomment-412303537, or mute the thread https://github.com/notifications/unsubscribe-auth/ACKqGK8i9YUNYXuQ8GBXjhWQdoq1bu9Oks5uP1FBgaJpZM4Vy68q .

kshedden commented 6 years ago

Can you share the summary table from one of the models where it fits in R but not in statsmodels? If you don't want to post it here you can email it to me at kshedden@umich.edu.

epierson9 commented 6 years ago

Sorry, can’t do that — confidential data.

On Mon, Aug 13, 2018 at 12:43 PM Kerby Shedden notifications@github.com wrote:

Can you share the summary table from one of the models where it fits in R but not in statsmodels? If you don't want to post it here you can email it to me at kshedden@umich.edu.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/4847#issuecomment-412583910, or mute the thread https://github.com/notifications/unsubscribe-auth/ACKqGDvtPatx0BuxhNgT5oJTK7C82keNks5uQayUgaJpZM4Vy68q .

epierson9 commented 5 years ago

One other thing I noticed about this -- while the mixed model does give "ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)" warnings when I fit it on my data, it always reports that it converged in the end (ie, in the model.summary() output, it always says "Converged: Yes."). Not sure what to make of that.

kshedden commented 5 years ago

Since these models can be hard to fit (due to non-convexity of the log-likelihood and/or weakly identified parameters), the algorithm tries several times to optimize using different methods and/or repeatedly applying the same algorithm with more iterations. The warnings you are seeing are probably coming from some of the failed attempts to optimize, but since you get converged=True in the end, the algorithm must have converged on its last attempt.

We refactored some of this logic recently, so you have more control over how this process unfolds (this is in the master, not yet in a release). If all attempts fail, then you will get a final warning that says "Gradient optimization failed, |grad|=...", and in this case converged will be False.

The relevant code in the master is here:

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/mixed_linear_model.py#L2045

epierson9 commented 5 years ago

Ah, ok, thanks! But as long as it converges in the end, I shouldn’t worry too much about the intermediate warnings?

On Fri, Nov 30, 2018 at 2:39 PM Kerby Shedden notifications@github.com wrote:

Since these models can be hard to fit (due to non-convexity of the log-likelihood and/or weakly identified parameters), the algorithm tries several times to optimize using different methods and/or repeatedly applying the same algorithm with more iterations. The warnings you are seeing are probably coming from some of the failed attempts to optimize, but since you get converged=True in the end, the algorithm must have converged on its last attempt.

We refactored some of this logic recently, so you have more control over how this process unfolds (this is in the master, not yet in a release). If all attempts fail, then you will get a final warning that says "Gradient optimization failed, |grad|=...", and in this case converged will be False.

The relevant code in the master is here:

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/mixed_linear_model.py#L2045

— You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/4847#issuecomment-443361181, or mute the thread https://github.com/notifications/unsubscribe-auth/ACKqGJZ88piP_djhDbBprJMIZ8AUe7pvks5u0bOKgaJpZM4Vy68q .

kshedden commented 5 years ago

That's right, those warnings can be ignored.

epierson9 commented 5 years ago

Thank you!! Very helpful.

On Fri, Nov 30, 2018 at 8:12 PM Kerby Shedden notifications@github.com wrote:

That's right, those warnings can be ignored.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/4847#issuecomment-443397872, or mute the thread https://github.com/notifications/unsubscribe-auth/ACKqGL6OQgQOuRx917pLNyxbzyU2l_r9ks5u0gGrgaJpZM4Vy68q .