rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
340 stars 30 forks source link

MVT-adjusted estimates "jump" even after assigned to a variable #490

Closed adrianolszewski closed 1 month ago

adrianolszewski commented 1 month ago

Hello, I assume the described issue comes from the fact that EM-mean is an object with methods performing some operations multiple times whenever the result is called.

I just realized that despite of setting the seed before the adjustment (Monte Carlo integration in mvtnorm), the results are still inconsistent between my printout from R console and the graph where I used it to display p-values.

I just "called" the name of the resulting variable multiple times in different places after it was assigned - and... it jumped. This costed me a lot of analyses to be repeated with putting the emmeans result explicitly into a data.frame... I would like to warn others, especially if they have tens of analyses to be reported.

I understand the behaviour, but it's really easy to forget about it. I mean - not about the fact that mvt needs the seed to be set. but that despite that it can jump anyway.

> set.seed(1000)
> (mmrm_contr <- update(contrast(mmrm_em,
+                                list(                        # W3    W6    M3    M6     M12    M20
+                                  "Month 6  : SoC vs. HD" = c(0,0,   0,0,  0,0, -1,1,   0,0,   0,0),
+                                  "Month 12 : SoC vs. HD" = c(0,0,   0,0,  0,0,  0,0,  -1,1,   0,0),
+                                  "Month 20 : SoC vs. HD" = c(0,0,   0,0,  0,0,  0,0,   0,0,  -1,1)
+                                )),
+                       adjust="mvt", level = 0.95, infer = c(TRUE, TRUE)))
 contrast              estimate   SE  df lower.CL upper.CL t.ratio p.value
 Month 6  : SoC vs. HD     4.73 4.24 107    -5.32     14.8   1.115  0.5326
 Month 12 : SoC vs. HD    10.07 5.19 107    -2.24     22.4   1.940  0.1326
 Month 20 : SoC vs. HD    17.61 4.38 107     7.24     28.0   4.024  0.0004

Confidence level used: 0.95 
Conf-level adjustment: mvt method for 3 estimates 
P value adjustment: mvt method for 3 tests 

and then I just call the result:

> mmrm_contr
 contrast              estimate   SE  df lower.CL upper.CL t.ratio p.value
 Month 6  : SoC vs. HD     4.73 4.24 107    -5.33     14.8   1.115  0.5326
 Month 12 : SoC vs. HD    10.07 5.19 107    -2.25     22.4   1.940  0.1325
 Month 20 : SoC vs. HD    17.61 4.38 107     7.23     28.0   4.024  0.0003

Confidence level used: 0.95 
Conf-level adjustment: mvt method for 3 estimates 
P value adjustment: mvt method for 3 tests 

OK, the difference is just little, but if this was something on the boundary of rounding, the issue would be much more visible.

Is there maybe any way to "lock" the result? Or is the only way to avoid that casting it into a data.frame explicitly or setting the seed any time I refer to this variable?

> set.seed(1000);
> mmrm_contr

 contrast              estimate   SE  df lower.CL upper.CL t.ratio p.value
 Month 6  : SoC vs. HD     4.73 4.24 107    -5.32     14.8   1.115  0.5326    #OK
 Month 12 : SoC vs. HD    10.07 5.19 107    -2.24     22.4   1.940  0.1326  #OK
 Month 20 : SoC vs. HD    17.61 4.38 107     7.24     28.0   4.024  0.0004  #OK

Confidence level used: 0.95 
Conf-level adjustment: mvt method for 3 estimates 
P value adjustment: mvt method for 3 tests 

Is this very behaviour mentioned explicitly somewhere in the documentation? If not, I think it really should.

Or - maybe you could add the "seed" parameter to emmeans, so it's always honoured every time "mvt" is requested without the need to set it "externally" not only every time one calls the update/contrast, but also calls the assigned variable too?

Argument ‘seed’ was ignored. Valid choices are:
adjust, alpha, avgd.over, bias.adjust, by.vars, calc, cross.adjust, delta, df, initMesg, estName, estType, famSize, frequentist, infer, inv.lbl, level, methDesc, nesting, null, predict.type, pri.vars, side, sigma, tran, tran.mult, tran.offset, tran2, type, is.new.rg, submodel, model.info, roles, grid, levels, matlevs, linfct, bhat, nbasis, V, dffun, dfargs, misc, post.beta
rvlenth commented 1 month ago

It is important to understand the distinction between the object and what you see when you display it. The famous quote of Rene Magritte suits this situation: "This is not a pipe" (it is a picture of a pipe).

The object created by emmeans, contrast, and several other functions is of class emmGrid (I think this is what you meant, not EM-mean). These objects have the information needed to compute estimates, but not the estimates themselves. When you display an emmGrid object, you can see what happens by looking at its print method:

> emmeans:::print.emmGrid
function(x, ..., export = FALSE)
    print(summary.emmGrid(x, ...), export = export)
<bytecode: 0x000002393f8ce488>
<environment: namespace:emmeans>

Note that when you print the object, it calls summary(). And if you print it again, it calls summary() again. It is summary.emmGrid that computes the estimates, adjusted P values, etc. Since the mvt method uses a randomized algorithm, we get different results each time summary() is called.

If you do something like this:

emm <- emmeans(..., adjust = "mvt")
semm <- summary(emm)
emm   # line 3
emm
emm
semm # line 6
semm
semm

you will get slightly different results from lines 3, 4, and 5, but the same results from lines 6, 7, and 8. That's because semm is of class summary_emm which is an annotated data frame. It has the computed estimates, adjusted P values, etc.

The other side of this coin is that you can do, say, contrast(emm, "pairwise") and get results, but contrast(semm, "pairwise") will give you an error because semm is just a data frame.

There is a lot of documentation in the emmeans package; Honestly, so much it is overwhelmong. Look at ? "emmGrid-class" for direct documentation of the class. There is also information in vignettes, e.g., a section in the "basics" vignette that explains these objects.

It would be inappropriate to add a seed argument to emmeans because that function has nothing to do with P-value adjustments. It's possible to add such an argument to summary.emmGrid, but I think that adds unneeded complexity. All you need is to know that the results of summary.emmGrid() are immutable. Also, ideally, from a computing standpoint, it is undesirable to repeatedly recompute the same quantities, as will happen when you repeatedly display an emmGrid object even if there is no "mvt" adjustment in force.

rvlenth commented 1 month ago

You might also want to look at the quick-start vignette, and in particular this section which covers the same ground. This is a recent addition to that vignette (before this question but after the update to CRAN version 1.10.1). It is also from the new site created by pkgdown, which is a little easier to navigate than the CRAN documentation.

adrianolszewski commented 1 month ago

Thank you very much for all these explanations! I can close it now.