Closed mattansb closed 1 week ago
@mattansb I think I disagree on both counts.
First, it is well-known that the mean of a transformation is not always the transformation of the mean. This doesn't mean that either quantity is "biased". It just means that they are different quantities. This was discussed previously, and the choice was to report the mean of the response-scale predictions by default. The other quantity is still available by specifying the type
argument. See discussion here: https://github.com/vincentarelbundock/marginaleffects/issues/1123
Second, lme4::predict.merMod()
documents this for the re.form
argument:
If ‘NULL’, include all random effects; if ‘NA’ or ‘~0’, include no random effects.
I feel it would be misleading to include predictions for quantities with random effects, accompanied by standard errors that ignore those random effects.
Yes, I agree that the correctness depends on the "type" (scale) of the estimand of interest. I also am not sure what is worse: a biased estimate or a biased SE.
I do consider it a biased estimate - it is given on the response, but is not a true expectation / AME, but a "generalized" expectation / AME, that is "a back-transformed expectation / AME". Which is a valid quantity (it even has a name!) I guess, but I can't think of a case where that actually would be the estimand of interest in the context of a GLMM: The predicted response value for a unit that has the mean random intercept/slope on the link scale? 🤷♂️
I'm not sure the discussion in #1123 fully captures the fact that setting re.form=NA
biases the response-scale estimate by ignoring the random variability (that is kept with re.form=NULL
). I might be wrong? I'm specifically talking about what Russ Lenth is discussing under CBPP example here wrt to GLMMs.
As you can tell, I have more questions than answers (and from #1123 it seems like I'm not alone).
I think (?) I'm just tapping into the same sentiment both you and Noah have that "thinking of an average of predictions, rather than as the inverse link of the average-linear, is more intuitive." (https://github.com/vincentarelbundock/marginaleffects/issues/1123#issuecomment-2240710502)
To be clear, we first compute individual-level predictions on the response scale, and then take the mean. So the defaults conform to your sentiment.
I'm not sure we're talking about exactly the same thing. Even without avg_*()
(or any other averaging requested explicitly), we still have averaging happening on the link scale with re.form=NA
- it's just happening implicitly in predict()
, not in {marginaleffects}
. This is different to your discussion with Noah that was concerned with possible biases caused by non-linear link scales - here were are biased due to partialling out random effect, while on the link scale.
I'll try to explain - hopefully I'll get my point across or you'll be able to reiterate that we are talking about the same thing.
First, the average random effect is 0. That is:
$$\text{MEAN}(UZ_j) = 0$$
Second, when generating a single prediction for an observation with re.form=NA
we are ignoring all random effects (or alternatively, we're conditioning on a group that have the average link-scale random effects). That is:
$$ y{ij} = link^{-1}(BX{ij}) = link^{-1}(BX{ij} + 0) = link^{-1}(BX{ij} + \text{MEAN}(UZ_j)) $$
Notice that this averaging is happening on the link scale.
Therefore, the average of such predictions is biased in a GLMM due to ignoring the random effects. Really, averaging the observation level predictions has a double averaging happening with re.form=NA
:
$$ \text{MEAN}(link^{-1}(BX_{ij} + \text{MEAN}(UZj))) \neq \text{MEAN}(link^{-1}(BX{ij} + UZ_j)) $$
The latter is what you would het with re.form=NULL
, which is the unbiased expectation.
@mattansb thanks for the clarifications.
In this PR, I drop the warning when the user specifies re.form=NULL
explicitly: https://github.com/vincentarelbundock/marginaleffects/pull/1266
I have not changed the warning yet. If you want to propose an alternative wording, I'm happy to consider. I'd like that wording to be extremely minimalist, because (a) I believe that the right place for statistical advice is a vignette or textbook, not a software warning, and (b) I can't invest enough time to read enough to be sure that the warning I craft will please everyone.
Thanks Vincent - the current warning seems fine to me. If you have a vignette that addresses the topic of predictions and the mean / mean predictions in GLMMs, maybe you can add a link to that. But otherwise this seems satisfactory.
Thanks!
cool cool, thanks.
And I realize now that I had completely misread your first post. Sorry about that. The feature request was actually very simple :)
Although the wording in the warning for mixed models is true - that the uncertenyl in the SEs does not reflect the uncertenyl in the estimated random effects (intercepts / slopes), setting
re.form=NA
for GLMMs produces biased estimates (often highly biased) due to the (annoying) fact that:$$ link^{-1}(mean(\eta)) \neq mean(link^{-1}(\eta)) $$
wrt the random effects (intercepts and slopes). That is, the back-transformed mean estimate isn't the mean back-transformed estimate.
re.form = NULL
is set explicitly. I think settingre.form
explicitly should suppress the warning.re.form = NULL/NA
is "perfect" in these model? Currently the wording is:Here is an example from my mixed models course:
The sarning suggests setting
re.form = NA
, but this produced biased estimates:Same for the difference:
(This doesn’t happen with a fixed effects model:)
Created on 2024-10-14 with reprex v2.1.1