lme4 / lme4

Mixed-effects models in R using S4 classes and methods with RcppEigen
Other
624 stars 148 forks source link

Overdispersion as part of lme4 summary #220

Open rvaradhan opened 10 years ago

rvaradhan commented 10 years ago

Hi, It would be nice if the summary(lme4.obj) included an estimate of overdispersion. On the glmm.wikidot.com/faq there is a function called `overdisp_fun()' that does this, but it would be more useful to the user if this estimation is automatically done when summary(lme4.obj) is invoked. Thanks for considering my request.

Best, Ravi

bbolker commented 10 years ago

The summary output for glm objects includes a line of the form

Residual deviance:  xxxxx  on xx degrees of freedom

Would that be sufficient?

But perhaps we don't need to add that, because the equivalent information is already reported at the top of the summary:

    AIC      BIC   logLik deviance df.resid 

One could argue that computing the ratio would be good for the user. glm() doesn't do this automatically, though -- I guess the argument is that this is something extra the user should do for themselves (lest we end up on the slippery slope to SAS-style "print all possible tests" output)

rvaradhan commented 10 years ago

Hi Ben, It seems that the information reported by overdisp_fun() function given on the lme4 wiki page (glmm.wikidot.com/faq) is not simply equivalent to the summary information (e.g., logLik, deviance, etc.). It is providing different information that the user may benefit from, and is not trivially obtainable from the current summary of glmer().

Let me reproduce the code here:

overdisp_fun <- function(model) {

number of variance parameters in

an n-by-n variance-covariance matrix

vpars <- function(m) { nrow(m)*(nrow(m)+1)/2 } model.df <- sum(sapply(VarCorr(model),vpars))+length(fixef(model)) rdf <- nrow(model.frame(model))-model.df rp <- residuals(model,type="pearson") Pearson.chisq <- sum(rp^2) prat <- Pearson.chisq/rdf pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE) c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval) }

Let me know if I am mistaken.

Thanks, Ravi

[edited out e-mail response since the history is stored on Github]

bbolker commented 10 years ago

Let me be a little bit more precise here. Information that might be useful to the user includes:

The two measures of overdispersion are asymptotically equivalent but as pointed out here the Pearson sum of squares is less biased (don't have a good ref for that, my copy of McCullagh and Nelder is loaned out at the moment). (There's other good stuff in that thread.)

So the residual deviance and df are probably adequate for getting a rough idea about the goodness-of-fit / degree of overdispersion, but if you were doing more formal testing you'd want (1) to calculate the ratio formally rather than eyeballing it (2) possibly to compute a p-value (keeping in mind that this too is approximate -- already for GLMs, and even more so for GLMMs since even the concept of "residual df" gets murky ...)

The tricky part here is deciding which information should actually go in the summary. R is traditionally minimalist in this respect (presumably to avoid creeping SAS-ism), and if we follow glm strictly we wouldn't add anything to the summary.

(For what it's worth here's a link to overdispersion tests described by Cameron and Trivedi

Although it does look like earlier versions of lmer printed an 'estimated scale parameter' in the summary output ...

bbolker commented 10 years ago

PS about the first half of the overdispersion function given above can now be replaced simply by rdf <- df.residual(model), which does the same (approximate!) calculation ...

rvaradhan commented 10 years ago

Thanks, Ben. I have an example where the two measures of overdispersion are quite different. Hence my inquiry.

summary(mod2) Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod'] Family: binomial ( logit ) Formula: imps79b ~ tx + I(sweek) + (sweek | id) Data: schiz

 AIC      BIC   logLik deviance df.resid

1259.4 1291.7 -623.7 1247.4 1597

Scaled residuals: Min 1Q Median 3Q Max -3.03973 0.00912 0.04957 0.21372 1.45395

Random effects: Groups Name Variance Std.Dev. Corr id (Intercept) 13.993 3.741 sweek 4.093 2.023 -0.65 Number of obs: 1603, groups: id, 437

Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 9.1994 1.3105 7.020 2.22e-12 tx -2.5075 0.5996 -4.182 2.89e-05

I(sweek) -3.1344 0.4503 -6.960 3.40e-12 ***

Signif. codes: 0 ‘_**’ 0.001 ‘_’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects: (Intr) tx tx -0.786 I(sweek) -0.922 0.537

Overdisperion based on Pearson residuals

overdisp_fun(mod2) chisq ratio rdf p 297.83 0.19 1597.00 1.00

From: Ben Bolker [mailto:notifications@github.com] Sent: Monday, July 21, 2014 2:04 PM To: lme4/lme4 Cc: Ravi Varadhan Subject: Re: [lme4] Overdispersion as part of lme4 summary (#220)

Let me be a little bit more precise here. Information that might be useful to the user includes:

The two measures of overdispersion are asymptotically equivalent but as pointed out herehttps://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015537.html the Pearson sum of squares is less biased (don't have a good ref for that, my copy of McCullagh and Nelder is loaned out at the moment). (There's other good stuff in that thread.)

So the residual deviance and df are probably adequate for getting a rough idea about the goodness-of-fit / degree of overdispersion, but if you were doing more formal testing you'd want (1) to calculate the ratio formally rather than eyeballing it (2) possibly to compute a p-value (keeping in mind that this too is approximate -- already for GLMs, and even more so for GLMMs since even the concept of "residual df" gets murky ...)

The tricky part here is deciding which information should actually go in the summary. R is traditionally minimalist in this respect (presumably to avoid creeping SAS-ism), and if we follow glm strictly we wouldn't add anything to the summary.

(For what it's worth here's a link to overdispersion tests described by Cameron and Trivedihttp://books.google.ca/books?id=d6BTfLBbxjcC&pg=PA89&lpg=PA89&dq=cameron+trivedi+overdispersion&source=bl&ots=Vgol1rnmCc&sig=94U9Ht_uMiM77nL0MXWBF1qhctA&hl=en&sa=X&ei=p1XNU6qqFLCj8gGS_oHwDA&ved=0CDkQ6AEwAg#v=onepage&q=cameron%20trivedi%20overdispersion&f=false

Although it does look like https://stat.ethz.ch/pipermail/r-help/2006-September/112779.html<earlier%20versions%20of%20lmer%20printed%20an%20'estimated%20scale%20parameter'%20in%20the%20summary%20output> ...

— Reply to this email directly or view it on GitHubhttps://github.com/lme4/lme4/issues/220#issuecomment-49642176.

bbolker commented 10 years ago

There's an issue here that I fixed in a very recent commit ... (i.e. the reported deviance is not really a deviance, but -2L, which differs by a constant (important in this context!)) Try comparing sum(residuals(.,type="deviance")^2) with sum(residuals(.,type="pearson")^2) instead ...

bbolker commented 10 years ago

In any case, if these are binary responses, then you probably shouldn't be worrying about overdispersion in the first place! (I think this is treated at http://glmm.wikidot.com/faq , although I'm not sure ...)

rvaradhan commented 10 years ago

I am not sure that I follow. Are you saying that there is a problem with the overdisp_fun() in how it approximates overdisperion?

I appreciate the remark about the inappropriateness of overdispersion for Bernoulli response. I just took a glance at McCullagh and Nelder sections 4.4.3 and 4.4.5.

Thanks, Ravi

From: Ben Bolker [mailto:notifications@github.com] Sent: Monday, July 21, 2014 2:12 PM To: lme4/lme4 Cc: Ravi Varadhan Subject: Re: [lme4] Overdispersion as part of lme4 summary (#220)

There's an issue here that I fixed in a very recent commit ... (i.e. the reported deviance is not really a deviance, but -2L, which differs by a constant (important in this context!)) Try comparing sum(residuals(.,type="deviance")^2) with sum(residuals(.,type="pearson")^2) instead ...

— Reply to this email directly or view it on GitHubhttps://github.com/lme4/lme4/issues/220#issuecomment-49643180.

rvaradhan commented 10 years ago

While i.i.d. binary responses cannot be overdispersed, I have repeated binary responses over time clustered within subjects. Now, it is fair game to try to estimate overdisperion, right? Best, Ravi

From: Ravi Varadhan Sent: Monday, July 21, 2014 2:24 PM To: 'lme4/lme4'; lme4/lme4 Subject: RE: [lme4] Overdispersion as part of lme4 summary (#220)

I am not sure that I follow. Are you saying that there is a problem with the overdisp_fun() in how it approximates overdisperion?

I appreciate the remark about the inappropriateness of overdispersion for Bernoulli response. I just took a glance at McCullagh and Nelder sections 4.4.3 and 4.4.5.

Thanks, Ravi

From: Ben Bolker [mailto:notifications@github.com] Sent: Monday, July 21, 2014 2:12 PM To: lme4/lme4 Cc: Ravi Varadhan Subject: Re: [lme4] Overdispersion as part of lme4 summary (#220)

There's an issue here that I fixed in a very recent commit ... (i.e. the reported deviance is not really a deviance, but -2L, which differs by a constant (important in this context!)) Try comparing sum(residuals(.,type="deviance")^2) with sum(residuals(.,type="pearson")^2) instead ...

— Reply to this email directly or view it on GitHubhttps://github.com/lme4/lme4/issues/220#issuecomment-49643180.

bbolker commented 10 years ago

Answers:

ID age resp
01 37  0
01 37  0
01 37  1

to

ID age success N
01 37   1   3
rvaradhan commented 10 years ago

Ben, I have longitudinal data, where the responses are evaluated at different times, and the time variable plays a critical role. So, the data cannot be collapsed.

I would also like to note that there seems to be a bug in the overdisp.glmer() function in "RVAideMemoire" package. It calculates “deviance” residuals rather than “Pearson” residuals. Consequently, the results of overdisp_fun() and overdisp.glmer() are different.

overdisp.glmer <- function (model) { vpars <- function(m) { nrow(m) * (nrow(m) + 1)/2 } model.df <- sum(sapply(lme4::VarCorr(model), vpars)) + length(lme4::fixef(model)) rdf <- nrow(model.frame(model)) - model.df rp <- residuals(model) # this is a mistake dev <- sum(rp^2) prat <- dev/rdf cat(paste("Residual deviance: ", round(dev, 3), " on ", rdf, " degrees of freedom", " (ratio: ", round(prat, 3), ")\n", sep = "")) }

Best, Ravi

From: Ben Bolker [mailto:notifications@github.com] Sent: Monday, July 21, 2014 10:01 PM To: lme4/lme4 Cc: Ravi Varadhan Subject: Re: [lme4] Overdispersion as part of lme4 summary (#220)

Answers:

ID age resp

01 37 0

01 37 0

01 37 1

to

ID age success N

01 37 1 3

— Reply to this email directly or view it on GitHubhttps://github.com/lme4/lme4/issues/220#issuecomment-49689684.

bbolker commented 10 years ago

If you have binary data that can't be collapsed, then you can't identify overdispersion.

I would say it's more a matter of judgement (see previously referenced link to r-sig-mixed-models thread) than a bug to use the sum of squared deviance residuals instead of Pearson residuals to assess overdispersion (aods3::gof() uses both).

rvaradhan commented 10 years ago

It is known that the sum of squared Pearson residuals from a GLM are asymptotically chi-squared (under appropriate conditions), but I don’t know if there is a corresponding result for deviance residuals.

Best, Ravi

From: Ben Bolker [mailto:notifications@github.com] Sent: Tuesday, July 22, 2014 10:13 AM To: lme4/lme4 Cc: Ravi Varadhan Subject: Re: [lme4] Overdispersion as part of lme4 summary (#220)

If you have binary data that can't be collapsed, then you can't identify overdispersion.

I would say it's more a matter of judgement (see previously referenced link to r-sig-mixed-models thread) than a bug to use the sum of squared deviance residuals instead of Pearson residuals to assess overdispersion (aods3::gof() uses both).

— Reply to this email directly or view it on GitHubhttps://github.com/lme4/lme4/issues/220#issuecomment-49744105.

rvaradhan commented 10 years ago

I was just looking at McCullagh and Nelder. Since the sum of squared deviance residuals is total deviance, D, and D is asymptotically chi-squared under certain regularity conditions, it seems that deviance residuals can also be used to check for overdispersion.

In my example, these two statistics for overdispersion (Pearson-based and deviance-based) are so different that I was concerned (although the final inference was the same).

overdisp_fun(mod1) chisq ratio rdf p 1074.94 0.67 1599.00 1.00

overdisp.glmer(mod1) Residual deviance: 719.914 on 1599 degrees of freedom (ratio: 0.45); p-value: 1

Thanks & Best, Ravi

From: Ravi Varadhan Sent: Tuesday, July 22, 2014 10:17 AM To: 'lme4/lme4'; lme4/lme4 Subject: RE: [lme4] Overdispersion as part of lme4 summary (#220)

It is known that the sum of squared Pearson residuals from a GLM are asymptotically chi-squared (under appropriate conditions), but I don’t know if there is a corresponding result for deviance residuals.

Best, Ravi

From: Ben Bolker [mailto:notifications@github.com] Sent: Tuesday, July 22, 2014 10:13 AM To: lme4/lme4 Cc: Ravi Varadhan Subject: Re: [lme4] Overdispersion as part of lme4 summary (#220)

If you have binary data that can't be collapsed, then you can't identify overdispersion.

I would say it's more a matter of judgement (see previously referenced link to r-sig-mixed-models thread) than a bug to use the sum of squared deviance residuals instead of Pearson residuals to assess overdispersion (aods3::gof() uses both).

— Reply to this email directly or view it on GitHubhttps://github.com/lme4/lme4/issues/220#issuecomment-49744105.