rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
358 stars 31 forks source link

Marginal Means in Truncated Models #331

Closed awaldman123 closed 2 years ago

awaldman123 commented 2 years ago

I posted this over on the glmmTMB listserv but thought I could provide an example dataset here that may be helpful:

Example.csv

I am running a hurdle negative binomial model to look at the differences in counts of differing types in different locations. My major interest is the conditional model (ie when counts are above 0).

I run the following code:

data<-read.csv("Example.csv",header=T)
as.factor(data$ID)->data$ID
as.factor(data$Location)->data$Location
as.factor(data$Type)->data$Type
model<-glmmTMB(Count ~ Location*Type + (1 | ID), zi=~Location*Type + (1|ID), data=data, family="truncated_nbinom1",control=glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")))

var.corr <-VarCorr(model)

Conditional model:
Groups Name        Std.Dev.
ID     (Intercept) 0.37105

Zero-inflation model:
Groups Name        Std.Dev.
ID     (Intercept) 1.3207 

emmeans <- emmeans(model, ~ Location*Type, type="response", sigma=0.37105, bias.adjust=TRUE)

Location Type response    SE  df lower.CL upper.CL
0     0             1.117 0.277 631    0.687     1.82
1     0             0.940 0.251 631    0.556     1.59
2     0             0.893 0.266 631    0.498     1.60
0     1             1.325 0.254 631    0.909     1.93
1     1             1.090 0.248 631    0.698     1.70
2     1             1.452 0.300 631    0.967     2.18

Confidence level used: 0.95
Intervals are back-transformed from the log scale
Bias adjustment applied based on sigma = 0.37105

However, I’m not sure why the estimated means and confidence intervals will include values below 1 in the conditional model as I anticipated these values would represent the average number of non-zero counts? Is there something I may be doing wrong or not understanding?

Based on Ben Bolker's response I tabulated the mean for location0/type1 including 0s and without 0s. Including 0s=1.05 and without 0s=1.88. The emmeans output seems to be in the middle of those two.

rvlenth commented 2 years ago

Support for emmeans for glmmTMB models is in the glmmTMB package. However, I believe those estimates are based on the linear predictor for the conditional model only. That is, these are estimates of the mean of the fitted negative binomial distribution before it is truncated. Perhaps Ben can confirm this.

I think estimated means for hurdle and zero-inflated models are very difficult to interpret without taking into account both parts of the model; and, unfortunately, it is a somewhat complicated matter to do so that is not currently implemented. Seems worthwhile to try, though. One would have to include all the factors that affect both models; then from the truncated part, estimate its mean M and its probability of zero PT0. And from the ZI part, estimate its probability of zero PZ0. Then the estimated mean of the response is M * (1 - PZ0) / (1 - PT0), unless I made some kind of stupid mistake. The tricky part is estimating the SE of this result.

awaldman123 commented 2 years ago

Thanks! As a workaround, if I am just interested in the non-zero counts could I just remove the zeros and fit another model? I guess I was just unsure what model would be appropriate for nested count data with 0s removed?

rvlenth commented 2 years ago

No, I don't think that's a good idea. You truly will not know what you are estimating.

rvlenth commented 2 years ago

To be more precise, by removing the zeros from the data and re-fitting the model, the software will still be trying to fit the hurdle model, which includes the NB distribution. You may be taking away the zero part, but the NB part still has nonzero P(0), and with the way emmeans is set up, you will still be estimating the mean of the un-truncated distribution. So what you will get will be biased estimates based on a model that doesn't fit very well.

awaldman123 commented 2 years ago

I was thinking more along the lines of removing 0s and running a non-truncated model to estimate just the non-zero averages and perform pairwise comparisons?

rvlenth commented 2 years ago

What distributional model for your non-truncated version. If it is a distribution that has 0 in its support, you don't have any 0s in the model so it will not fit well. It seems you may be better off removing the 0s and subtracting 1 from the response, then add 1 back later to the estimates.

But I think it will be hard to defend this kind of approach either way. To me, it seems more defensible to present the means you already have shown, and compare them. You have already fitted your hurdle model to the data. If that model fits and makes sense, then those means are the means of the distribution whose truncated portion fits the data, save for the zeros. But having said that, I'll admit I really haven't worked with these kinds of models in my consulting experience.

rvlenth commented 2 years ago

I think this issue is completed in some sense. I would hope for improvements to emmeans support in the glmmTMB package that could facilitate some semblance of combining information from the respones and zi portions of a model. Anyway, I am closing this issue; that does not prevent further comments or even re-opening if that is appropriate.