How to reproduce behavior of adjdfe=source in SAS lsmeans or slice statement

qdread commented 1 year ago

Hi Russ, thanks as usual for the great package and how well you support and maintain it.

I am working on a proof of concept for a stats lesson where I demonstrate how to use emmeans. The lesson is targeted at SAS users, so I had the idea of trying to generate identical results in SAS and R.

Initially I thought there was a discrepancy in the way the Sidak adjustment is done in emmeans::contrast() as opposed to the lsmeans or slice statement of proc glimmix. I was finally able to replicate the behavior of emmeans::contrast() when I set the option adjdfe=row in lsmeans statement of proc glimmix. This led me to read in the SAS documentation that the default for lsmeans and slice statements is adjdfe=source, viz.

ADJDFE=ROW ADJDFE=SOURCE specifies how denominator degrees of freedom are determined when p-values and confidence limits are adjusted for multiple comparisons with the ADJUST= option. When you do not specify the ADJDFE= option or when you specify ADJDFE=SOURCE, the denominator degrees of freedom for multiplicity-adjusted results are the denominator degrees of freedom for the LS-mean effect in the "Type III Tests of Fixed Effects" table. When you specify ADJDFE=ROW, the denominator degrees of freedom for multiplicity-adjusted results correspond to the degrees of freedom that are displayed in the DF column of the "Differences of Least Squares Means" table.

I was curious if there is some way to replicate the adjdfe=source behavior in the emmeans package, or if you have any interesting insight or recommendation as to why you have a different default than glimmix.

reproducible emmeans example

library(emmeans)
library(agridat)

fit <- lmer(yield ~ calcium * fert * soil + (1|rep) + (1|fert:rep) + (1|soil:rep), data = cox.stripsplit)
emm <- emmeans(fit, ~ fert | calcium)
contrast(emm, 'pairwise', adjust = 'sidak')

corresponding SAS code

Below, the first slice statement reproduces the p-values of contrast() but the second doesn't.

data cox;
input rep $ soil $ fert $ calcium $ yield;
datalines;
R1 S1 F0 C0 4.91
R1 S1 F0 C1 4.63
R1 S1 F1 C0 4.76
R1 S1 F1 C1 5.04
R1 S1 F2 C0 5.38
R1 S1 F2 C1 6.21
R1 S1 F3 C0 5.6
R1 S1 F3 C1 5.08
R1 S2 F0 C0 4.94
R1 S2 F0 C1 3.98
R1 S2 F1 C0 4.64
R1 S2 F1 C1 5.26
R1 S2 F2 C0 5.28
R1 S2 F2 C1 5.01
R1 S2 F3 C0 5.45
R1 S2 F3 C1 5.62
R1 S3 F0 C0 5.2
R1 S3 F0 C1 4.45
R1 S3 F1 C0 5.05
R1 S3 F1 C1 5.03
R1 S3 F2 C0 5.01
R1 S3 F2 C1 4.63
R1 S3 F3 C0 5.8
R1 S3 F3 C1 5.9
R2 S1 F0 C0 6
R2 S1 F0 C1 5.39
R2 S1 F1 C0 4.95
R2 S1 F1 C1 5.39
R2 S1 F2 C0 6.18
R2 S1 F2 C1 5.94
R2 S1 F3 C0 6.58
R2 S1 F3 C1 6.25
R2 S2 F0 C0 5.86
R2 S2 F0 C1 5.41
R2 S2 F1 C0 5.54
R2 S2 F1 C1 5.41
R2 S2 F2 C0 5.28
R2 S2 F2 C1 6.67
R2 S2 F3 C0 6.65
R2 S2 F3 C1 5.94
R2 S3 F0 C0 5.45
R2 S3 F0 C1 5.12
R2 S3 F1 C0 4.73
R2 S3 F1 C1 4.62
R2 S3 F2 C0 5.06
R2 S3 F2 C1 5.75
R2 S3 F3 C0 6.39
R2 S3 F3 C1 5.62
R3 S1 F0 C0 4.96
R3 S1 F0 C1 5.63
R3 S1 F1 C0 5.47
R3 S1 F1 C1 5.31
R3 S1 F2 C0 6.18
R3 S1 F2 C1 6.31
R3 S1 F3 C0 5.95
R3 S1 F3 C1 6.14
R3 S2 F0 C0 5.71
R3 S2 F0 C1 5.37
R3 S2 F1 C0 6.21
R3 S2 F1 C1 5.83
R3 S2 F2 C0 6.28
R3 S2 F2 C1 6.55
R3 S2 F3 C0 6.39
R3 S2 F3 C1 5.57
R3 S3 F0 C0 4.6
R3 S3 F0 C1 4.9
R3 S3 F1 C0 4.88
R3 S3 F1 C1 4.73
R3 S3 F2 C0 5.89
R3 S3 F2 C1 6.2
R3 S3 F3 C0 5.68
R3 S3 F3 C1 5.72
R4 S1 F0 C0 5.79
R4 S1 F0 C1 5.33
R4 S1 F1 C0 5.13
R4 S1 F1 C1 5.18
R4 S1 F2 C0 5.86
R4 S1 F2 C1 5.98
R4 S1 F3 C0 5.55
R4 S1 F3 C1 4.32
R4 S2 F0 C0 5.61
R4 S2 F0 C1 5.15
R4 S2 F1 C0 4.82
R4 S2 F1 C1 5.06
R4 S2 F2 C0 5.67
R4 S2 F2 C1 5.54
R4 S2 F3 C0 5.19
R4 S2 F3 C1 4.46
R4 S3 F0 C0 5.13
R4 S3 F0 C1 4.9
R4 S3 F1 C0 4.88
R4 S3 F1 C1 5.18
R4 S3 F2 C0 5.45
R4 S3 F2 C1 5.8
R4 S3 F3 C0 5.12
R4 S3 F3 C1 4.42
run;

proc glimmix data = cox;
    class rep soil fert calcium;
    model yield = soil|fert|calcium / ddfm=kr;
    random rep fert(rep) soil(rep);
    slice fert*calcium / sliceby=calcium diff cl adjust=sidak adjdfe=row;
    slice fert*calcium / sliceby=calcium diff cl adjust=sidak;
run;

rvlenth commented 1 year ago

It took me a while to understand why one would have a d.f. option connected specifically with adjustments to P values. Then I realized that if you have an adjustment like, say, Tukey, then we are talking about the distribution of the maximum of a Studentized range and the Studentization is technically based on one SD estimate, so there would be an argument for specifying just one value for df. The built-in pstudent() function in R does not enforce this requirement in the way it vectorizes the function, as we can see by comparing these results:

> ptukey(1:5, 4, df = 3:7)   # like adjdfe = row
[1] 0.1113336 0.4471334 0.7363707 0.8943063 0.9625459

> ptukey(1:5, 4, df = 5)  # like adjdfe = source
[1] 0.1097220 0.4575163 0.7363707 0.8779368 0.9416749

... where only the 3rd values match.

Such a justification for using a single df value does not exist for the Sidak adjustment (nor Bonferroni, nor any of the p.adjust.methods, because these adjustments are based on the vector of unadjusted P values; and it seems wrong to me to distort those unadjusted P values when the df are not all the same. So I think SAS has the wrong default.

I am presently not inclined to implement an adjdfe option. My reasons are

it is not always appropriate (as just described for Bonferroni, Sidak, ...)
the output from summary includes a df column that pretty strongly suggests that each P value as based in part on the d.f. shown.
things are already complicated enough without creating yet another option
the user can always manually specify the desired d.f.
even with adjdfe = source, we are unlikely to duplicate what SAS does, because the df for the joint tests won't be the same (I admit this is a weak reason because it's probably possible to remedy that).

I don't have very easy access to SAS anymore, but I think maybe the results SAS would produce with its default adjdfe = source are

> pairs(emm, adj = "sidak", df = 9)
calcium = C0:
 contrast estimate    SE df t.ratio p.value
 F0 - F1     0.258 0.255  9   1.014  0.9150
 F0 - F2    -0.280 0.255  9  -1.099  0.8825
 F0 - F3    -0.516 0.255  9  -2.025  0.3675
 F1 - F2    -0.538 0.255  9  -2.113  0.3263
 F1 - F3    -0.774 0.255  9  -3.039  0.0813
 F2 - F3    -0.236 0.255  9  -0.926  0.9425

calcium = C1:
 contrast estimate    SE df t.ratio p.value
 F0 - F1    -0.148 0.255  9  -0.582  0.9941
 F0 - F2    -0.861 0.255  9  -3.380  0.0478
 F0 - F3    -0.398 0.255  9  -1.564  0.6289
 F1 - F2    -0.713 0.255  9  -2.797  0.1185
 F1 - F3    -0.250 0.255  9  -0.981  0.9260
 F2 - F3     0.463 0.255  9   1.816  0.4784

Results are averaged over the levels of: soil 
Degrees-of-freedom method: user-specified 
P value adjustment: sidak method for 6 tests

qdread commented 1 year ago

Thank you for your insightful response!

As it turns out, SAS's default adjdfe = source uses 54 as the denominator degrees of freedom. When I ran pairs(emm, adj='sidak', df=54) I got the same adjusted p-values as SAS produces. That is the denominator df from the Type III F-test for fertilizer calcium. But I agree with your intuition that 9 would be the more appropriate denominator df to use because we are making a pairwise comparison of the fertilizer means within each level of calcium, not comparing all levels of fertilizer calcium. So, maybe SAS is using an especially inappropriate default df for comparisons done within the slice statement.

I also agree with your arguments for why it isn't a good idea to implement the adjdfe option. I don't think it is necessary to perfectly reproduce the output of other statistical packages. I mainly asked this question out of curiosity about the different default behavior. I really appreciate your help.

rvlenth commented 1 year ago

Well, if SAS is using the df for fertilizer calcium, then it is dead wrong, because that is the d.f. for the family of interaction contrasts for that term. Interaction contrasts have coefficients that sum to zero and whose marginal sums also sum to zero. In this example, the pairwise comparisons of fertilizer calcium combinations (which are not interaction contrasts, they are contrasts of cell means) will have 54 d.f. only if they are on the same fertilizer, and 11.9 df otherwise. (Check it out: pairs(emm, by = NULL))

qdread commented 1 year ago

Thanks, that makes a lot of sense. I might draw the attention of the other USDA statisticians to this. Many of them are dyed in the wool SAS users! :-D

rvlenth commented 1 year ago

OK, but just to be clear, that is a common mistake, and SAS is a solid, reliable product that does a whole lot of stuff quite well.

Russ

rvlenth commented 1 year ago

Closing this issue as resolved

rvlenth / emmeans

How to reproduce behavior of adjdfe=source in SAS lsmeans or slice statement #431

reproducible emmeans example

corresponding SAS code