Closed Generalized closed 10 months ago
[I have revised this comment since originally writing it.]
First, I'll try to answer the question in the headline...
Would it be possible to allow the "df" parameter to accept a vector of values rather than just a single scalar?
My answer its that it is possible, but not advisable. Believe me, I have thought about this before. The df
argument is passed to ref_grid()
, so if you were allowed to give it a vector of numbers, that list would have to make sense at the time the reference grid is constructed, which could very likely differ from what the user is thinking of when calling, say, emmeans()
. And even if the user understood that, they would have to correctly anticipate the order of the reference-grid notes; so allowing this would be a pretty fragile user interface.
Second, you are incorrect in characterizing emmeans
as making model assumptions. It is the emm_basis
method for the model class that determines how d.f. calculations are made, via the @dffun
and @dfargs
slots. This question is particular to a model object that is supported by the robustlmm package -- not emmeans. It seems appropriate to ask the robustlmm developer to consider improving the d.f. support that is provided.
Thank you very much for the explanation!
I thought that the vector of df could be actually a data frame constructed in a way that reflects the combinations of factors in the grid, so the user provides a data.frame (for instance) like this:
X Y df
X1 Y1 10
X1 Y2 15.5
X2 Y1 16
X2 Y2 30
....
Whenever something is missing - "Inf" could be imputed with a warning, or error thrown: "missing df for the following elements of the grid". Whether it's valid or not would be up to the user. I think that people advanced enough to specify their own df will be aware and care of it by tracking the output and whether it matches their specification.
I don't think authors of robustlmm will do that, because it's a non-standard approach. I find it sensible, my more advanced colleagues at work (with whom I consulted this issue) would follow it as well, but this is not something that is "formally supported", even if sounds reasonable in certain cases.
OK, I will try either to "intercept" the code in the robustlmm code which returns the df (Inf) and "inject" mine, or take the test statistic and do the pnorm() calculation based on the specified df on my own, overwriting the output of emmeans.
Thank you very much for all your efforts and let me wish you all the best in 2024!
Thanks for the new year wishes -- and the same back to you!
Regarding d.f., I don't think it's anything like you describe. What's needed for the @dffcn
slot is a function that will determine the d.f. for estimating a linear function $\hat\theta = \sum c_j \hat \beta_j$ where $\hat\theta$ is one of the EMMs or a contrast thereof, the $c_j$ are given constants and the $\hat\beta_j$ are the regression coefficients for the model. The Satterthwaite d.f., for example, is based on the relationship between $Var(\hat\theta)$ and $E(\hat\theta^2)$. and the "containment" method used for lme
models is the d.f. associated with the coarsest grouping associated with nonzero $c_j$ values. In neither case are we doing bookkeeping on the observations in the dataset, but rather the role of groupings on the $\hat\beta_j$ and the weights given them in estimating $\theta$.
Dear @rvlenth Thank you for the response!
I could obtain Satterthwaite "for free" from lme4 and provide here. Yes, I do now that both estimators (coefficients, variances) will be different, as the estimation was different and weighting was employed, so anything based on that will be off (a little or a lot). In my current case I think they won't be that far from the original one (maybe in edge cases). Above N=30 the difference between z and t isn't that big, so the df itself shouldn't be critical unless above 20, and I rarely deal below it. It could be (like in my case) something between 40-50 and it would be acceptable.
I remember from one vignette (about effect size) that you took a similar approach by mentioning it's a non-critical thing and provided some range.
Actually I could even live with Inf, as the differences won't be too big, but my reviewers won't accept it. They will say: "OK, they are close, but don't make it bigger naively, make it closer to reasonable number". And yes, I could use just average of the dfs (still better than Inf), but they will complain it's the same value... even if it doesn't actually matter too much.
And I can do a little or nothing with that.
PS: I found that sometimes emmeans assigns the infinity, for example in emmeans:::emm_basis.merMod
mode = match.arg(tolower(mode), c("satterthwaite", "kenward-roger",
"asymptotic"))
if (!is.null(options$df) && (mode != "kenward-roger"))
mode = "asymptotic"
..... and later ....
if (mode == "asymptotic") {
dffun = function(k, dfargs) Inf
}
Which makes sense, as robustlmm doesn't provide itself the df and p-value:
> summary(m_mmrm)
Robust linear mixed model fit by DAStau
Formula: frm
Data: data
Control: lmerControl(check.nobs.vs.nRE = "warning")
Scaled residuals:
Min 1Q Median 3Q Max
-1.82765 -0.48266 -0.00967 0.36703 3.00827
Random effects:
Groups Name Variance Std.Dev.
PatientId (Intercept) 0.003911 0.06253
Residual 0.001147 0.03386
Number of obs: 91, groups: PatientId, 44
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.0004035 0.0115931 -0.035
TimepointWeek 16 0.0128727 0.0092463 1.392
TimepointWeek 56 0.0116329 0.0104089 1.118
And that's why I thought you set it in some scenarios, as the last-resort option.
By the way, let's check in this very case: lmerModLmerTest (selected Satterthwaite)
> coef(summary(m_mmrm1))
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -7.504107e-05 0.026180493 45.90616 -0.002866297 0.99772545
TimepointWeek 16 1.640811e-02 0.009221525 45.53196 1.779327325 0.08186167
TimepointWeek 56 1.052267e-02 0.010355144 45.41416 1.016177970 0.31492800
with infinite df; robustlmm; There differences as there are a few outliers that are weighted-down.
> as.data.frame(coef(summary(m_mmrm))) %>% mutate(df = Inf, p = 2*pt(abs(`t value`), df, lower=FALSE))
Estimate Std. Error t value df p
(Intercept) -0.0004034906 0.011593143 -0.03480424 Inf 0.9722358
TimepointWeek 16 0.0128727105 0.009246332 1.39219640 Inf 0.1638629
TimepointWeek 56 0.0116329400 0.010408877 1.11759804 Inf 0.2637387
df=10
> as.data.frame(coef(summary(m_mmrm))) %>% mutate(df = 10, p = 2*pt(abs(`t value`), df, lower=FALSE))
Estimate Std. Error t value df p
(Intercept) -0.0004034906 0.011593143 -0.03480424 10 0.9729208
TimepointWeek 16 0.0128727105 0.009246332 1.39219640 10 0.1940460
TimepointWeek 56 0.0116329400 0.010408877 1.11759804 10 0.2898692
or even with the calculated Satt. df
> as.data.frame(coef(summary(m_mmrm))) %>% mutate(df = coef(summary(m_mmrm1))[,"df"], p = 2*pt(abs(`t value`), df, lower=FALSE))
Estimate Std. Error t value df p
(Intercept) -0.0004034906 0.011593143 -0.03480424 45.90616 0.9723867
TimepointWeek 16 0.0128727105 0.009246332 1.39219640 45.53196 0.1706257
TimepointWeek 56 0.0116329400 0.010408877 1.11759804 45.41416 0.2696163
How can I pass it through dffun? I tried:
m_mmrm_em@dffun <- function() coef(summary(m_mmrm1))[,"df"]
but it says:
> m_mmrm_em
Error in object@dffun(x, object@dfargs) :
unused arguments (x, object@dfargs)
I'm sorry, I've never played with the object system in R :(
d.f. do matter when they are small, and that can happen easily in mixed models, because it's the numbers of groups and subgroups and such that come into play.
I also point out that reviewers are not always right, and sometimes it's important to talk back to them, with objective support of course. I would suggest that it's probably appropriate to not exceed the d.f. for a parallel non-robust model, because robustifying things generally leads to loss of efficiency. Most robust methods yield a set of case weights as part of their calculations, and using the d.f. resulting from refitting the model with those weights, and using a non-robust methods like lmer()
seems fairly defensible.
However, I'm not going to wade into this any further. There are a lot of technical details and I want to prioritize other matters. If you want to know more about how the slots are used, you can look at existing code like emmeans:::emm_basis.merMod
, at documentation for extending emmeans, and examples for emmobj
.
PS: I think I found a workaround :)
I just fit robustlmm, took the weights and... supplied them to the lme4 back.
> m_mmrm_lmer_rlmer <- lme4::lmer(formula = frm, data=data, control = lmerControl(check.nobs.vs.nRE = "warning"), weights = getME(m_mmrm_rlmer "w_e"))
> # lme4 with weights obtained from robustlmm
> coef(summary(m_mmrm_lmer_rlmer))
Estimate Std. Error t value
(Intercept) -0.0006501447 0.026083449 -0.02492557
TimepointWeek 16 0.0159041971 0.008826363 1.80189703
TimepointWeek 56 0.0116748979 0.009936630 1.17493532
> # robustlmm itself
> coef(summary(m_mmrm_rlmer))
Estimate Std. Error t value
(Intercept) -0.0004034906 0.011593143 -0.03480424
TimepointWeek 16 0.0128727105 0.009246332 1.39219640
TimepointWeek 56 0.0116329400 0.010408877 1.11759804
> # lme4 itself
> coef(summary(m_mmrm_lmer))
Estimate Std. Error t value
(Intercept) -7.504107e-05 0.026180493 -0.002866297
TimepointWeek 16 1.640811e-02 0.009221525 1.779327325
TimepointWeek 56 1.052267e-02 0.010355144 1.016177970
AND
> emmeans(m_mmrm_lmer_rlmer, specs = ~Timepoint, mode = "satterthwaite")
Timepoint emmean SE df lower.CL upper.CL
Week 8 -0.00065 0.0261 45.6 -0.0532 0.0519
Week 16 0.01525 0.0261 45.8 -0.0373 0.0678
Week 56 0.01102 0.0266 48.9 -0.0424 0.0644
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
> emmeans(m_mmrm_lmer, specs = ~Timepoint, mode = "satterthwaite")
Timepoint emmean SE df lower.CL upper.CL
Week 8 -0.000075 0.0262 45.9 -0.0528 0.0526
Week 16 0.016333 0.0262 45.9 -0.0364 0.0690
Week 56 0.010448 0.0267 49.3 -0.0432 0.0641
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
> emmeans(m_mmrm_rlmer, specs = ~Timepoint)
Timepoint emmean SE df asymp.LCL asymp.UCL
Week 8 -0.000403 0.0116 Inf -0.0231 0.0223
Week 16 0.012469 0.0116 Inf -0.0103 0.0352
Week 56 0.011229 0.0127 Inf -0.0136 0.0361
Confidence level used: 0.95
Weights:
It costs me double fitting, but that's not a big issue. If we assume that these weights are the final product of interest, putting them into lme4 makes sense and I get all the "pleasures and benefits" of this well-featured package.
The results are worse for lmer + weights, although visible. Weighting helped.
For robust lmer it was more visible:
Thank you for all help! I am going to end this discussion, but I just wanted to share the final results in case anyone might need it in future.
Let me illustrate the question with robust lme4 (robustlmm) and emmeans.
Below I fit the model and it tells me that there are 91 observations and 44 clusters. Unfortunately, it won't tell me the number of degrees of freedom. Therefore, emmeans assumes infinite number, using the normal approximation to t distribution.
In this very case, one could take various common approaches:
And some people actually did this already: Geniole Shawn N., Proietti Valentina, Bird Brian M., Ortiz Triana L., Bonin Pierre L., Goldfarb Bernard, Watson Neil V. and Carré Justin M. 2019. Testosterone reduces the threat premium in competitive resource division. 286. Proceedings of the Royal Society B. http://doi.org/10.1098/rspb.2019.0720
And it seems quite reasonable approach (better than residual DF or Infinity). But the Satterthwaite dfs can vary from group to group. The "df" argument of emmeans takes only a single (first) value. Would it be possible to allow for providing a vector of dfs?
Let's quickly demonstrate it on some data:
then:
OK, now let's employ the lme4:
but when I want to reuse the dfs, only the first value is taken: