statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
9.94k stars 2.87k forks source link

Repeated measures ANOVA #3262

Open TheChymera opened 7 years ago

TheChymera commented 7 years ago

I wanted to verify whether among different measurements sessions, any are significantly different from the others. One way to do this would be to use a repeated-measures ANOVA, but I see that the ANOVA implementation in statsmodels does not support this.

I thought another way would be to use MixedLM, which can deal with repeated measures, and then run an ANOVA on that model. I tried two approaches, using this data:

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.read_csv("~/MixedLM_data.csv")
lm = smf.mixedlm("t ~ session", df, groups=df["subject"]).fit()
anova = sm.stats.anova_lm(lm)

This fails with AttributeError: 'MixedLMResults' object has no attribute 'ssr'.

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.read_csv("~/MixedLM_data.csv")
lm = smf.mixedlm("t ~ session", df, groups=df["subject"]).fit()
anova = lm.f_test([[0,1,-1,0,0,0],[0,1,0,-1,0,0],[0,1,0,0,-1,0],[0,1,0,0,0,-1]])
print(anova)

This succeeds, but I am unsure whether I understand the .f_test() syntax correctly. I would like to test the hypothesis that all sessions are equal. I notice that an array of length 6 is required, though I have only 5 sessions. How come?

Can you help me out @kshedden ?

josef-pkt commented 7 years ago

Once you have the model and parameter names, then you can specify the constraint/contrasts for the f_test or t_test also with strings, either a comma separated list of equality conditions or a list of strings "param1 = param2, param2 = param3" or ["param1 = param2", "param2 = param3"] where param1, param2 should be the names as they show up for example in params This is easier if there are just a few constraints, otherwise I usually use np.eye or np.diag or similar to build up a large constraint matrix with -1, +1 and zeros in each row.

params in MixedLM will or might have additional parameters for the random effects. The constraints in f_test, wald_test and t_test needs to have as many columns as the length of params and cov_params.

anova_lm is currently only implemented for OLS. I don't know whether mixedlm has a likelihood ratio test or a ssr based f_test equivalent.

TheChymera commented 7 years ago

Ok, so based on what you said, the following should test if all my regressors (except the random effects) are equal (can you confirm this?):

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.read_csv("~/MixedLM_data.csv")
lm = smf.mixedlm("t ~ session", df, groups=df["subject"]).fit()
print(lm.params)
anova = lm.f_test([[1,-1,0,0,0,0],[1,0,-1,0,0,0],[1,0,0,-1,0,0],[1,0,0,0,-1,0]])
print(anova)

This gives me:

Intercept          1.260392
session[T.post]   -0.027365
session[T.t1]     -0.593442
session[T.t2]     -1.243611
session[T.t3]     -0.174372
Intercept RE       0.000003
dtype: float64
<F test: F=array([[ 1.86775433]]), p=0.1556140289064265, df_denom=20, df_num=4>

This looks like it could be right, except that I'm worried about how the intercept will be compared to the other regressors. As I understand it the value of the intercept is with respect to zero, whereas the value of the other regressors is with respect to the intercept. Is f_test() aware of the fact that the intercept is special in this regard?

josef-pkt commented 7 years ago

f_test doesn't know anything about what the columns mean. There is no special handling of a constant column in this.

[[1,-1,0,0,0,0],[1,0,-1,0,0,0],[1,0,0,-1,0,0],[1,0,0,0,-1,0]] It tests that the differences between the first and the second, third and forth coefficient is zero. That would be correct if you have all levels of the categorical variable.

However, AFAICS you use dummy coding where the first level "t0" (?) is the reference level where the effect is captured by the intercept. In this case, the hypothesis that all levels have the same effect as the reference level would be that session[T.post] to session[T.t3] all have zero coefficients, i.e. just a one at the position of the coefficient and zeros everywhere else in the row. (This would be the same hypothesis as dropping session from the regression.)

IIRC (it's late here): lmf_test("session[T.post], session[T.t1], session[T.t2], session[T.t3]") should be equivalent to the 1s at each position. =0 is the default if an equality is not specified

I think a sequential comparison should be equivalent in this case, i.e. [[0,1,0,0,0,0],[0,1,-1,0,0,0],[0,0,1,-1,0,0],[0,0,0,1,-1,0]]

TheChymera commented 7 years ago

In this case, the hypothesis that all levels have the same effect as the reference level would be that session[T.post] to session[T.t3] all have zero coefficients, i.e. just a one at the position of the coefficient and zeros everywhere else in the row.

So you mean this should be equivalent to what ANOVA does?:

anova = lm.f_test([[1,0,0,0,0,0]])
josef-pkt commented 7 years ago

No, for all the session params, i.e. the middle 4 coefficients are zero anova = lm.f_test([[0,1,0,0,0,0], [0,0,1,0,0,0], [0,0,0,1,0,0], [0,0,0,0,1,0]]) and you can check with the string version which should produce the same result.

TheChymera commented 7 years ago

Thank you for the info. I think it would help greatly to include this info into the documentation and/or examples here.

Is that website generated from this repo? or is it managed elsewhere?

TheChymera commented 7 years ago

@josef-pkt I would be happy to contribute a relevant example if you can tell me where the source of the documentation is

yl565 commented 7 years ago

@josef-pkt why I'm receiving this error?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-2b50b83a8566> in <module>()
     17     md = smf.mixedlm("value ~ nback", data, groups=data["subject"])
     18     mdf = md.fit()
---> 19     anova = md.f_test([[0,1,0,0],[0,0,1,0]])
     20     print(anova)
     21 

AttributeError: 'MixedLM' object has no attribute 'f_test'

statsmodels version: '0.8.0.dev0+5797248'

kshedden commented 7 years ago

There is no f_test implemented for MixedLM. A PR would be welcome, but note there is a lot of debate about how to do this properly, especially in terms of setting the degrees of freedom.

Kerby

On Thu, Dec 1, 2016 at 7:12 PM, Yichuan Liu notifications@github.com wrote:

@josef-pkt https://github.com/josef-pkt why I'm receiving this error?

---------------------------------------------------------------------------AttributeError Traceback (most recent call last) in () 17 md = smf.mixedlm("value ~ nback", data, groups=data["subject"]) 18 mdf = md.fit()---> 19 anova = md.f_test([[0,1,0,0],[0,0,1,0]]) 20 print(anova) 21 AttributeError: 'MixedLM' object has no attribute 'f_test'

statsmodels version: '0.8.0.dev0+5797248'

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statsmodels/statsmodels/issues/3262#issuecomment-264336026, or mute the thread https://github.com/notifications/unsubscribe-auth/ACiww8gmmhAVzFKe6-VGxxdIpJdzDarmks5rD2KGgaJpZM4Kq057 .

josef-pkt commented 7 years ago

@yl565 You used the model instance instead of the results instance f_test and wald_test are inherited in the Results instance AFAICS from the code, but they most likely won't work, but should raise a shape mismatch or some other exception. wald_test has a use_t option to switch between t and normal distribution for the test and the corresponding default otherwise it's the same as f_test. However, wald_test requires df_resid, AFAIR, even though it is not used if use_t=False. MixedLMResults defines df_resid, I've just seen.

So wald_test (instead of f_test) could be made to work for MixedLM by using the same overwriting of the inherited method as in t_test.

All models except for the linear regression models like OLS and WLS default to using the asymptotic normal distribution. (default use_t=False is inherited unless overwritten)

josef-pkt commented 6 years ago

I'm adding missing labels, but I'm not sure what the status is.

josef-pkt commented 6 years ago

just an update and a gist (while triaging unlabeled issues)

both f_test and wald_test work with the full params length, e.g. rslt2.f_test(np.eye(len(rslt2.params)))

However, t_test uses only the fixed effects params (and result in t_test look to be the same as in summary (which is summary2 version), and raises/breaks with full length constrained np.eye(len(rslt2.params))

rslt2.k_fe < len(rslt2.params)

rslt2.t_test(np.eye(rslt2.k_fe))

gist, using a copy-paste of a test case and trying out the test methods https://gist.github.com/josef-pkt/74d76966c7770598bfdb491569808939

jpzhangvincent commented 3 years ago

Is there a roadmap to add likelihood ratio test for mixedlm models?