Student's versus Welch's t in emmeans package; apparent inconsistency

R180 commented 5 years ago

Greetings,

I recently had an exchange with Jonathan Love (developer of jamovi statistical software) about some behavior of the emmeans package, which jamovi makes use of. I submitted a request to Jonathan that jamovi enforce some consistency regarding the use of Student’s t versus some alternative-to-Student's (Welch’s?) t tests in post-hoc comparisons. Through trial and error, I've noticed that in a non-repeated-measures ANOVA, the degrees of freedom for the post-hoc tests do not change no matter how unequal the within-cell variances are. So I assume they are based on Student's t. But for repeated-measures ANOVA, the degrees of freedom for each post-hoc test does depend on the degree of inequality of variances. So I presume these are Welch's (or some other non-Student's) t test. Jonathan said that this difference in the type of t test employed is caused by the design of the emmeans package (not jamovi) and so the issue should be directed to the developer(s) of the emmeans package. So my question is, what t-test types does emmeans employ, why is there an apparent inconsistency in the type of post-hoc t test employed, and is there a way to fix the inconsistency?

rvlenth commented 5 years ago

The choice of which t test is used is most definitely NOT in the design of emmeans. The emmeans() function and its relatives simply use the model that is fitted to the data, and use an appropriate test for that model. If the model is fitted using, say, lm() with a univariate response vector, then that model is based on an assumprion of homogeneous error variances, and that dictates that a pooled t test is used. If on the other hand, the model is fitted using nlme::gls(), and a specification was made in gls() that the variances depend on treatments, then something akin to the Welch t tests will result in emmeans(). Thus, in order for jamovi to produce unequal-variances tests for post hoc comparisons, it is jamovi's responsibility to allow for some model that estimates non-homogeneous-variances, and to pass that model to emmeans().

For repeated-measures situations, social scientists (at least historically) seem to concentrate on the choice between the "univariate" and "multivariate" methods, depending on some assessment of sphericity or compound symmetry. The former typically involves pooled t tests. The latter could be performed using lm() but with a multivariate response (I think some part of the afex package uses that too), and that is supported by emmeans -- again provided that is the model that is used.

A somewhat separate matter is degrees of freedom. While the t statistic may come out the same as the Welch t statistic, the model object may or may not support the needed Satterthwaite degrees-of-freedom calculation. Most don't.

I will try to alert Jonathon Love to this response and see if I can somehow add him as a participant.

jonathon-love commented 5 years ago

hey russell,

The choice of which t test is used is most definitely NOT in the design of emmeans. The emmeans() function and its relatives simply use the model that is fitted to the data, and use an appropriate test for that model.

haha, i think we might be arguing semantics. i feel like these two sentences contradict each other :) but i get what you're saying.

so @R180 has observed that emmeans when fed a 'normal' ANOVA produces t stats similar/same as a student's t-test, but when emmeans is fed an RM ANOVA, it produces t stats similar/same as a welch's t-test.

@R180 felt this wasn't quite correct/consistent, so i suggested he check in with you.

is this all working as expected? or does something here suggest that jamovi isn't doing something correctly? (we're both social scientists [unfortunately?], so you may find it easier to explain using that vernacular).

with thanks

rvlenth commented 5 years ago

Well, I don't really know what jamovi does, so don't understand the contradiction. If the user chooses the model, then the user must choose an appropriate one. If jamovi chooses it, then perhaps there needs to be clearer communication about the impact of those choices. But to @R180, the main point is that emmeans() does not look at the data, it looks at the model.

Jonathon, FWIW, I am not a social scientist, and often have trouble understanding social scientists' needs. I have been known to lock horns with social scientists over issues like post hoc power calculations (I think they are a sham). I'm a retired statistics prof, and like physical science/engineering applications the best.

R180 commented 5 years ago

Well the thing is, I input data with wildly different variances that could not possibly fit the equal-variance ANOVA model, but still no Welch's test (I don't think). See below and attached.

I think that at a minimum, the output should indicate in a footnote what type of t test was used, and what model criterion value led to the use of that particular type. Example: " [image: Wildly different variances.png]

The _____. Therefore each post hoc test employs a Welch's t."

On Sun, Feb 17, 2019 at 4:13 PM Jonathon Love notifications@github.com wrote:

hey russell,

The choice of which t test is used is most definitely NOT in the design of emmeans. The emmeans() function and its relatives simply use the model that is fitted to the data, and use an appropriate test for that model.

haha, i think we might be arguing semantics. i feel like these two sentences contradict each other :) but i get what you're saying.

so @R180 https://github.com/R180 has observed that emmeans when fed a 'normal' ANOVA produces t stats similar/same as a student's t-test, but when emmeans is fed an RM ANOVA, it produces t stats similar/same as a welch's t-test.

@R180 https://github.com/R180 felt this wasn't quite correct/consistent, so i suggested he check in with you.

is this all working as expected? or does something here suggest that jamovi isn't doing something correctly? (we're both social scientists [unfortunately?], so you may find it easier to explain using that vernacular).

with thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rvlenth/emmeans/issues/90#issuecomment-464510085, or mute the thread https://github.com/notifications/unsubscribe-auth/APemyzUEEtkb1yUn1wTMSkGPjelTM1iAks5vOcYHgaJpZM4a_kop .

rvlenth commented 5 years ago

Honestly, I can't help with that. It's up to whatever process leads to developing the model. Again, emmeans() does not analyze the data. If you have wildly different variances, then it is up to you to choose an appropriate model.

R180 commented 5 years ago

OK. So I'll assume that somewhere in jamovi, there is a model parameter value in the ANOVA routine (i.e., specification of Student's t test) that's det differently in the Repeated-Measures ANOVA routine (i.e., specification of Welch's t. Thanks.

jonathon-love commented 5 years ago

hi @R180,

there is a practice in the social sciences, that you perform an ANOVA, if it's significant, you do post-hoc t-tests (simply on the data) to determine where the difference(s) is/are. with this approach, i don't actually think it makes sense to do the ANOVA in the first place, seeings as you're using completely different models with the post-hoc t-tests. you should just do the t-tests.

we've been meaning to develop a pair-wise t-test module for jamovi for this other approach (which, we don't really think is a good approach, but hopefully can spur discussion).

the beauty of the emmeans approach is that the t-tests make use of the same model as the primary test.

jonathon-love commented 5 years ago

OK. So I'll assume that somewhere in jamovi, there is a model parameter value in the ANOVA routine (i.e., specification of Student's t test) that's det differently in the Repeated-Measures ANOVA routine (i.e., specification of Welch's t.

no, i don't think this is true. i think the different t-values are simply emergent phenomena of the models.

i think if we set the RM Model up in such a way that it provided student's t-test type results, it would no longer be an RM ANOVA.

if you want to see the code, here we run the RM ANOVA:

https://github.com/jamovi/jmv/blob/5ed9f64d6ea903d1c9453a75847809d656524116/R/anovarm.b.R#L43

and here we call emmeans on the model:

https://github.com/jamovi/jmv/blob/5ed9f64d6ea903d1c9453a75847809d656524116/R/anovarm.b.R#L539

the t-values are determined by the model, not by a decision somewhere to use one type of t-test over another.

R180 commented 5 years ago

Ok. I think that such behavior is problematic. We"very already established, I think, that the choice of the type of t test is wholly inirely dependent on what model jamovu and/or javovi choose to attempt to fit to the data, and wholly independent of the specific characteristics if the data. So for the package to be credible, I think there needs to be clear documentation which model parameters govern the selection of the type if t test.

On Sun, Feb 17, 2019, 6:21 PM Jonathon Love <notifications@github.com wrote:

OK. So I'll assume that somewhere in jamovi, there is a model parameter value in the ANOVA routine (i.e., specification of Student's t test) that's det differently in the Repeated-Measures ANOVA routine (i.e., specification of Welch's t.

no, i don't think this is true. i think the different t-values are simply emergent phenomena of the models.

i think if we set the RM Model up is such a way that it provided student's t-test type results, it would no longer be an RM ANOVA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rvlenth/emmeans/issues/90#issuecomment-464525293, or mute the thread https://github.com/notifications/unsubscribe-auth/APemywapvMGKdFgAWqXHeS8XX0mLcEAAks5vOePggaJpZM4a_kop .

R180 commented 5 years ago

Or to but it slightly differently, the documentation needs to present a convincing rationale, and explanation, for the model's behavior. The model can be capable of selections between Student's t only if it had been given the freedom to make that choice. Do there needs to be a stated rational for granting the model that freedom. Without such a rationale, the scientust cannot defend the model's choices, and so cannot defend his it her use if the model.

On Sun, Feb 17, 2019, 6:21 PM Jonathon Love <notifications@github.com wrote:

OK. So I'll assume that somewhere in jamovi, there is a model parameter value in the ANOVA routine (i.e., specification of Student's t test) that's det differently in the Repeated-Measures ANOVA routine (i.e., specification of Welch's t.

no, i don't think this is true. i think the different t-values are simply emergent phenomena of the models.

i think if we set the RM Model up is such a way that it provided student's t-test type results, it would no longer be an RM ANOVA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rvlenth/emmeans/issues/90#issuecomment-464525293, or mute the thread https://github.com/notifications/unsubscribe-auth/APemywapvMGKdFgAWqXHeS8XX0mLcEAAks5vOePggaJpZM4a_kop .

rvlenth commented 5 years ago

Hmmmm. Well, there is a practice in statistics, whereby one fits a model, then -- before doing any testing at all, much less, post hoc comparisons -- one makes some serious efforts to determine whether the model reasonably fits the data. This process involves a lot of residual plots and such, and also perhaps some tests of particular model assumptions, such as some kind of test of variance homogeneity like Levene's test, or (in the case or repeated measures) sphericity. If one just forges ahead with a bunch of statistical tests that may or may not be appropriate, that's reckless. [Insert some random analogy with recent proceedings involving the US Government.]

Statistics is serious and difficult business, and a recipe approach is as sure to fail as using a sea-level recipe for chocolate cake when high in the Andes. Any reasonable user interface would ensure that users have the opportunity to assess the success of the fitted model in explaining the data and in meeting its underlying assumptions.

Additionally, while I do not personally use the afex package, I am pretty sure its aov model setups do try to perform some of these steps in their default output, and the returned object does include both the "univariate" and "multivariate" models so that the user may choose which one to use in post hoc analyses. To me, that is still too routinized, but it's better than plundering ahead without ever questioning the underlying model. In the case of jamovi, it is a serious design failure if the user does not have the opportunity to choose the model, or at least to be informed as to which one was chosen automatically.

jonathon-love commented 5 years ago

the choice of the type of t test is wholly entirely dependent on what model jamovi chose to attempt to fit to the data

yes. but it's worth noting that we just fit standard models. we haven't made any decisions here.

I think there needs to be clear documentation which model parameters govern the selection of the type if t test.

so i suppose this is a question for @rvlenth. is there a ruleset a curious person could use to determine what type of t-test emmeans will use where russell? so there's the suggestion that with an RM ANOVA model emmeans uses a welch's t-test (i don't know if it actually does or not, i assumed it just looked like one, but let's say it is welch's t-test – at the risk of sounding/being ignorant, it's not immediately obvious to me from the model, why a welch's would be used there [but i'm certainly not suggesting it's incorrect]).

i kinda imagined the choice of t-test (and corrections?) was all pretty complicated - and i've been happy to assume emmeans is doing the right thing.

with thanks

jonathon-love commented 5 years ago

one makes some serious efforts to determine whether the model reasonably fits the data

sorry, we do provide all of that, i just took that being performed for granted.

rvlenth commented 5 years ago

You can't take it for granted if the post hoc results are displayed before the user has a chance to make such determinations.

jonathon-love commented 5 years ago

You can't take it for granted if the post hoc results are displayed before the user has a chance to make such determinations.

yes, that's true. which is why we don't display the post hoc results straight away. but i feel like this is a bit of a tangent.

jonathon-love commented 5 years ago

(or are you talking about the pairwise t-test module? in which case, we're agreed that's a bad idea).

rvlenth commented 5 years ago

Well in fact I don't know what I'm talking about, as I haven't used jamovi. So I should stay out of this and let you and Richard figure this out.

jonathon-love commented 5 years ago

but returning to my earlier question, if i understand correctly, richard wants to better understand the choices emmeans makes wrt t-tests.

is there a ruleset a curious person could use to determine what type of t-test emmeans will use? so there's the suggestion that with an RM ANOVA model emmeans uses a welch's t-test (i don't know if it actually does or not, i assumed it just looked like one, but let's say it is welch's t-test – at the risk of sounding/being ignorant, it's not immediately obvious to me from the model, why a welch's would be used there [but i'm certainly not suggesting it's incorrect]).

i kinda imagined the choice of t-test (and corrections?) was all pretty complicated - and i've been happy to assume emmeans is doing the right thing.

with thanks

rvlenth commented 5 years ago

I will say it yet again: EMMEANS() DOES NOT MAKE A CHOICE REGARDING WHICH T TEST TO USE. YOU ARE ABSOLUTELY INCORRECT TO ASSUME IT IS CHOOSING WHICH ONE TO USE.

It uses the one model that it is given, and it computes t tests based on that model. If you think it is receiving more than one model and is being asked to choose between them, you are mistaken and you need to recode jamovi accordingly. If it's an object from afex, I think Henrik's driver (not mine) for those models defaults to the univariate model and there is an optional argument for specifying the multivariate one. But you (or your user) has to specify that option.

jonathon-love commented 5 years ago

It uses the one model that it is given, and it computes t tests based on that model

sorry, i think we're muddled by semantics. the word 'choice' or 'choose' might be muddling us.

i mean these two statements to be equivalent:

"the package runs a t-test based on the model" or "the package chooses a t-test based on the model"

but either way, we're interested in what properties of the model, lead to different t-tests being performed.

for example, there's the suggestion that an RM ANOVA leads to a welch's t-test being performed - i'm trying to understand why - because it doesn't seem obvious (or to determine that its not actually a welch's t-test).

it sounds like we might need to ask henrik next ... but i can imagine him saying "t-tests? that's all emmeans, you'd have to ask russell"

anyhow, i hope you're not too exasperated by all of this. i do try very hard to communicate clearly, but there are the odd spectacular fail.

rvlenth commented 5 years ago

Yes, I am completely exasperated, because you are assuming that my software does things that it simply does not do.

This is not a matter of semantics. I do not think you are understanding what I am trying to tell you -- e.g., what is meant by a statistical model. One model leads to one form of t test. If you want another form of t test, you need a different model.

emmeans() does not make any choices; it does not choose; it does not make decisions about the best way to analyze data. It takes one model object and computes stuff based on the model formula, its regression coefficients, and the estimated covariance matrix of of those regression coefficients. All that stuff is based on a set of assumptions defined in the model and hence implicitly chosen by the user when the model was fitted, -- whether the error variance is homogeneous; whether things are correlated and how; etc. All that is in place before emmeans() comes into the picture. Given those coefficients, model formula, and covariance matrix, we decide on a bunch of linear functions of those regression coefficients, obtain estimates of those linear functions, and estimate their standard errors. Those results lead to t statistics defined as the estimates divided by their standard errors. Period. There is no step anywhere along the line where some other set of t statistics is computed based on another set of assumptions. There is only one model and only one set of assumptions upon which that model is based.

There is nothing anywhere in the documentation of the emmeans package that says that different assumptions are examined and used to decide on an appropriate form of a t statistic. I don't know where you got that idea.

R180 commented 5 years ago

Perhaps I can ask a simpler question. Do we know what formula emmeans uses to calculate the degrees of freedom for a given post-hoc t test? Or is that formula computed by some other package on which emmeans depends?

-- Richard Anderson

rvlenth commented 5 years ago

It uses a formula for depending on the model class. If it is lm, it uses $df.residual. If it is lmerMod, it uses a sophisticated algorithm in the pbkrtest package. If it is aovList, it uses Satterthwaite. And so forth. If you like, you can examine the code by listing the corresponding emm_basis method, e.g.,

emmeans:::emm_basis.lm
emmeans:::emm_basis.lmerMod
emmeans:::emm_basis.aovList
### etc

And I remind you that the model class is determined by the R code you used to fit the model. Fior example

warp.lm = lm(breaks ~ wool * tension, data = warpbreaks)
class(warp.lm)

This is just another way of saying that emmeans() uses the model to get all its information -- including how to compute d.f.

rvlenth commented 5 years ago

I'll be more specific with that example...

Model where we assume constant error variance

> warp.lm = lm(breaks ~ wool * tension, data = warpbreaks)
> emmeans(warp.lm, pairwise ~ tension | wool)
$emmeans
wool = A:
 tension emmean   SE df lower.CL upper.CL
 L         44.6 3.65 48     37.2     51.9
 M         24.0 3.65 48     16.7     31.3
 H         24.6 3.65 48     17.2     31.9

wool = B:
 tension emmean   SE df lower.CL upper.CL
 L         28.2 3.65 48     20.9     35.6
 M         28.8 3.65 48     21.4     36.1
 H         18.8 3.65 48     11.4     26.1

Confidence level used: 0.95 

$contrasts
wool = A:
 contrast estimate   SE df t.ratio p.value
 L - M      20.556 5.16 48  3.986  0.0007 
 L - H      20.000 5.16 48  3.878  0.0009 
 M - H      -0.556 5.16 48 -0.108  0.9936 

wool = B:
 contrast estimate   SE df t.ratio p.value
 L - M      -0.556 5.16 48 -0.108  0.9936 
 L - H       9.444 5.16 48  1.831  0.1704 
 M - H      10.000 5.16 48  1.939  0.1389 

P value adjustment: tukey method for comparing a family of 3 estimates

Model where we assume a different error variance for each cell:

> library(nlme)
> warp.gls = gls(breaks ~ wool * tension, data = warpbreaks, 
                 weights = varIdent(form = ~ 1 | wool*tension))
> emmeans(warp.gls, pairwise ~ tension | wool)
$emmeans
wool = A:
 tension emmean   SE df lower.CL upper.CL
 L         44.6 6.03 48     32.4     56.7
 M         24.0 2.89 48     18.2     29.8
 H         24.6 3.42 48     17.7     31.4

wool = B:
 tension emmean   SE df lower.CL upper.CL
 L         28.2 3.29 48     21.6     34.8
 M         28.8 3.14 48     22.5     35.1
 H         18.8 1.63 48     15.5     22.1

Confidence level used: 0.95 

$contrasts
wool = A:
 contrast estimate   SE df t.ratio p.value
 L - M      20.556 6.69 48  3.074  0.0096 
 L - H      20.000 6.94 48  2.883  0.0159 
 M - H      -0.556 4.48 48 -0.124  0.9916 

wool = B:
 contrast estimate   SE df t.ratio p.value
 L - M      -0.556 4.55 48 -0.122  0.9918 
 L - H       9.444 3.67 48  2.574  0.0346 
 M - H      10.000 3.54 48  2.824  0.0186 

P value adjustment: tukey method for comparing a family of 3 estimates

These are essentially the Welch tests; however, gls doesn't use the Satterthwaite method to get the d.f.

At this point, I suggest you not use jamovi to do your analyses, as it apparently does not currently support the kind of models you need.

R180 commented 5 years ago

Thank you. This is actually quite illuminating and goes a long way toward addressing my initial question.

You indicated "These are essentially the Welch tests; however, gls doesn't use the Satterthwaite method to get the d.f. Satterthwaite, which I believe is another name for Welch's, does, typically use a unique method to calculate degrees of freedom--a method that can produce fractional rather than integer-value degrees of freedom, an that can produce d.f. values that are responsive to the variances in the cells being tested. So instead of having d.f. = 48 for each contrast, d.f. could vary across the contrasts. I have been looking at output and assuming that the post-hoc tests Student's t tests because the d.f. did not reflect the Welch-Satterwaithe method (i.e, were not responsive to variance inequality across cells and were never fractional.

So the ANOVA does not use the Satterthwaite method to get the d.f. However, repeated-measures ANOVA does use the Satterthwaite method to get the d.f. And so repeated measures post hocs are responsive to unequal cell variances, and are sometimes fractional, and so does sometimes produce a different d.f. value despite n being constant.

I'm starting to think that these inconsistencies (e.g., the fact that the gls routine doesn't use the Satterthwaite d.f., while some other routine, perhaps written by someone else, for repeated-measures post-hocs does use Satterthwaite d.f.) may be part of what one accepts when one uses R packages. Each package author may make choices that aren't wrong, but that also may not be consistent with choices made by other authors. And when those various packages are brought together inside a an overarching package, the inconsistencies become more noticeable. Is this what's probably happening? Or am I off base here? @rvlenth @jonathon-love

rvlenth commented 5 years ago

There is indeed a lot of diversity in how R packages are implemented. Some are more carefully written than others. Some reflect the biases of their developers and may pass over certain areas that are important to others. Many serve a very general class of situations, so that specific methods such as Welch's t test become just a special case, for which details such as degrees-of-freedom methods kind of go by the wayside.

This is actually true of any open-source product that accepts contributed code. Much of what you get is on the bleeding edge of new developments -- but it can be your blood!

PS but don't lose sight of what's important. The d.f. are not as important as having t ratios that are correctly made. Once the d.f. exceed 15 or so, it hardly makes a difference; but the wrong t value makes a big difference.

rvlenth commented 5 years ago

FWIW, I have added to the gls support a crude Satterthwaite method:

> emmeans(warp.gls, pairwise ~ tension | wool, mode = "sat")
$emmeans
wool = A:
 tension emmean   SE   df lower.CL upper.CL
 L         44.6 6.03 7.97     30.6     58.5
 M         24.0 2.89 7.93     17.3     30.7
 H         24.6 3.42 7.86     16.6     32.5

wool = B:
 tension emmean   SE   df lower.CL upper.CL
 L         28.2 3.29 8.05     20.7     35.8
 M         28.8 3.14 7.99     21.5     36.0
 H         18.8 1.63 8.23     15.0     22.5

d.f. method: satterthwaite (via simulation) 
Confidence level used: 0.95 

$contrasts
wool = A:
 contrast estimate   SE   df t.ratio p.value
 L - M      20.556 6.69 11.4  3.074  0.0255 
 L - H      20.000 6.94 12.5  2.883  0.0331 
 M - H      -0.556 4.48 15.3 -0.124  0.9916 

wool = B:
 contrast estimate   SE   df t.ratio p.value
 L - M      -0.556 4.55 15.9 -0.122  0.9918 
 L - H       9.444 3.67 11.8  2.574  0.0594 
 M - H      10.000 3.54 12.0  2.824  0.0379 

P value adjustment: tukey method for comparing a family of 3 estimates

In general, Satterthwaite d.f. requires finding the variance of the variance estimates. I do this by simulation, perturbing the response values and obtaining a gradient in the estimated covariance matrix. So the result varies a bit from one call to another. In the above, each group has 9 observations, so each mean should have 8 d.f., and the results shown are pretty close. For the differences, we can compare with what we get by other means; for example, comparing (wool AS, tension L) with (wool A, tension M), we have

> with(warpbreaks, t.test(breaks[1:9], breaks[10:18]))

    Welch Two Sample t-test

data:  breaks[1:9] and breaks[10:18]
t = 3.0736, df = 11.481, p-value = 0.01011

and the output shows 11.4 d.f. (the P values differ because of the (approximate) Tukey adjustment).

This additional feature will be in the next emmeans update. Note that it applies only to gls, not to all models.

jonathon-love commented 5 years ago

ah! russell this is the answer is was fishing for! i just didn't know how to ask it:

Given those coefficients, model formula, and covariance matrix, we decide on a bunch of linear functions of those regression coefficients, obtain estimates of those linear functions, and estimate their standard errors. Those results lead to t statistics defined as the estimates divided by their standard errors.

so returning to richard's remark:

I'm starting to think that these inconsistencies (e.g., the fact that the gls routine doesn't use the Satterthwaite d.f., while some other routine, perhaps written by someone else, for repeated-measures post-hocs does use Satterthwaite d.f.) may be part of what one accepts when one uses R packages.

but shouldn't the t stats for a model, i.e. len ~ dose + supp, be identical irrespective of what R package was used? and if they are different, it should be possible to pin-point where a package isn't doing the correct thing (i.e. gls doesn't implement Satterwaithe for example)?

it's been an assumption of mine that with emmeans that for any given model formula, there's one correct set of post-hocs – irrespective of what R package was used to construct the model object.

am i correct on this russell? (or does each package construct it's own post-hoc strategy, and puts that in the model object?)

with thanks

rvlenth commented 5 years ago

I give up. I wrote the code and I know what it does and doesn’t do.

Jonathon, I very much appreciate what you have done to help me understand and rein-in some package-dependencies issues. But in this thread, you have fished for a misinterpretation of one verb in my latest extended answer. I have run out of ways to explain it. So I think I have to just let you continue to believe what you insist on believing, regardless of my protests. It is too bad but there is no other option.

Perhaps you have a statistics department (as opposed to a social science department) near you with a colleague who can understand the way I think, and who can sit down with you and explain what I’ve been trying to say. A model is more than its formula; it is also the underlying assumptions upon which it is based.

BTW, in social-science terms, what we are seeing here is a blazing example of confirmation bias.

singmann commented 5 years ago

Sorry for adding something more here after it is closed. But as the author of afex, which is the package that I believe Jamovi still uses in the backend for ANOVAs, I feel I can try to untangle some of the discussion here.

The question as I understand it is: What are the degrees of freedoms for repeated-measures models based on the aov model. Because currently, afex still passes the aov model per default to emmeans in case there are repeated measures factors. And it is also not fully clear to me, how the degrees of freedom are obtained in that case. I can currently only say that they are based on the aov model and they estimate different variances as shown below.

Let us use the example data set described in more details here. It has both, between-subjects and within-subjects factors.

library("afex")
library("emmeans")
data(sk2011.1)
a1 <- aov_ez("id", "response", sk2011.1, between = "instruction", 
       within = c("inference", "plausibility"))

For the between-subjects factors, the df are easily to understand, N - k:

emmeans(a1, "instruction")
# NOTE: Results may be misleading due to involvement in interactions
#  instruction   emmean   SE df lower.CL upper.CL
#  deductive       77.7 3.56 38     70.5     85.0
#  probabilistic   80.5 3.56 38     73.3     87.7
# 
# Results are averaged over the levels of: inference, plausibility 
# Confidence level used: 0.95 
length(unique(sk2011.1$id))
# [1] 40

For repeated-measures factors, the df are not that clear. For example:

emmeans(a1, "inference")
# NOTE: Results may be misleading due to involvement in interactions
#  inference emmean   SE  df lower.CL upper.CL
#  MP          87.5 3.78 127     80.0     95.0
#  MT          76.7 3.78 127     69.2     84.2
#  AC          69.4 3.78 127     61.9     76.9
#  DA          83.0 3.78 127     75.5     90.4
# 
# Results are averaged over the levels of: instruction, plausibility 
# Confidence level used: 0.95

For the other factor:

emmeans(a1, "plausibility")
# NOTE: Results may be misleading due to involvement in interactions
#  plausibility emmean   SE   df lower.CL upper.CL
#  plausible      86.2 2.79 54.7     80.6     91.8
#  implausible    72.1 2.79 54.7     66.5     77.7
# 
# Results are averaged over the levels of: instruction, inference 
# Confidence level used: 0.95

As a reminder, they are based on the aov part of the model object which shows the following summary:

summary(a1$aov)
# Error: id
#             Df Sum Sq Mean Sq F value Pr(>F)
# instruction  1    622   621.6   0.307  0.583
# Residuals   38  77042  2027.4               
# 
# Error: id:inference
#                        Df Sum Sq Mean Sq F value   Pr(>F)    
# inference               3  14827    4942   5.809 0.000990 ***
# instruction:inference   3  15308    5103   5.998 0.000785 ***
# Residuals             114  96988     851                     
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Error: id:plausibility
#                          Df Sum Sq Mean Sq F value   Pr(>F)    
# plausibility              1  16046   16046   34.23 9.13e-07 ***
# instruction:plausibility  1   5001    5001   10.67  0.00232 ** 
# Residuals                38  17815     469                     
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Error: id:inference:plausibility
#                                     Df Sum Sq Mean Sq F value  Pr(>F)   
# inference:plausibility               3   2096   698.6   2.867 0.03970 * 
# instruction:inference:plausibility   3   2911   970.3   3.982 0.00971 **
# Residuals                          114  27779   243.7                   
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This shows that this model estimates individual variances per error strata. How exactly the df are calculated from this, is not something I fully understand (even after looking at the source code. But maybe you can give a high-level overview of this, Russ?

One final word, I agree that it is somewhat unfortunate that afex uses the aov model here. But when I initially programmed it, that was the only way to use emmeans for repeated measures ANOVA. Since some time, we can now also use the multivariate model for that (the lm or better mlm model). This can be changed globally via option: afex_options("emmeans_model")

The default is "univariate", but "multivariate" is now also supported. Note we can also set the model directly in the call to emmeans as well. If we do so, we get the 'familiar' df.

emmeans(a1, "inference", model = "multivariate")
#  inference emmean   SE df lower.CL upper.CL
#  MP          87.5 1.80 38     83.9     91.2
#  MT          76.7 4.06 38     68.5     84.9
#  AC          69.4 4.77 38     59.8     79.1
#  DA          83.0 3.84 38     75.2     90.7
# 
# Results are averaged over the levels of: instruction, plausibility 
# Confidence level used: 0.95

I plan to change the default in the future from the aov model to the lm model. But when is the right time for such a change that breaks backward-compatibility? Probably not in my first year at a new faculty job.

jonathon-love commented 5 years ago

sorry russell,

this has been hard for me too. everytime you've characterised what i think, my thoughts are "that isn't what i think", and every time you describe what is the case, my thoughts are "that is (more or less) what i think".

i feel like you've seized on the word 'choose' as having a very specific meaning, when i mean it in a very general sense ... and i was hoping by pointing out that you'd used the word 'decide' that you would see we're actually on the same page here.

you seem bewildered that i can think such foolish things, and i'm a bit bewildered that you can think that i think such foolish things. so it seems like the best move is to leave this discussion.

anyhow, i have appreciated you taking the time, even when you were feeling frustrated.

jonathon

rvlenth commented 5 years ago

Henrik's remarks bring up a whole different dimension to this whole thing, having to do with mixed models and a different manifestation of the Satterthwaite df that have to do with combining errors from different sources (error strata, where at least in aov objects, the errors within each stratum are assumed to have constant variance) as opposed to what happens in the Welch procedure where there is only one source of error but its variance isn't homogeneous. To try to respond to all that, I'd have to write out notes for a semester's course in linear models, and in doing so, I'd dig an even bigger hole for myself.

The challenge faced by a software developer is that, by making sophisticated methods available, there is an implicit assumption that the user understands the methods being offered and is responsible for ensuring that s/he is using it appropriately. But of course, there will always be users who don't understand what is appropriate and consequently will misuse the software -- perhaps resulting in an analysis that is markedly worse than some simpler analysis they could have done more or less correctly. Then there is the challenge of simplified user interfaces aimed at less experienced users, where we are also trusting that the design of the interface keeps users out of trouble and that the user interface correctly represents what is being offered -- but may not always fulfill that need either. And, let's admit it, people like me are part of the problem too. Many professional statisticians are a lot more interested in math than in data and scientific reasoning, and sometimes lead people down a very misleading path. So you can't trust us either. I could claim myself as an exception, but I'd be wrong in doing so because I have my own set of blind spots, and I am blind to my own blind spots.

I guess what maybe I'll try to do is add something to the FAQs vignette and perhaps another illustrative example somewhere else. (Or perhaps a whole new vignette on repeated measures? No, that'd just be asking for trouble.) But I'm not going to comment further here.

rvlenth commented 5 years ago

OK, I'll say one more thing. In response to Jonathon's question:

... does each package construct it's own post-hoc strategy, and puts that in the model object?

YES. Each package potentially implies a different post hoc strategy. You choose the package and modeling options that do the best job of fitting the data. Then you give that result to emmeans() and it will do appropriate post hoc tests.

jonathon-love commented 5 years ago

thanks russell,

i see that's where i was incorrect. i'd assumed that for a given model formula, irrespective of package used to construct the model, i'd get the same post hocs. i see now the situation is more complicated than that.

with thanks

rvlenth / emmeans

Student's versus Welch's t in emmeans package; apparent inconsistency #90

Model where we assume constant error variance

Model where we assume a different error variance for each cell: