statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.13k stars 2.88k forks source link

Chosing the right hypothesis test #6583

Open Jimmy2027 opened 4 years ago

Jimmy2027 commented 4 years ago

I have two distributions of volume conservation factors (VCF) Generic and Generic Masked that I want to compare. A visualization of the two distributions can be found here: https://imgur.com/a/tgUp7in

The VCF being optimal if equal to 1, I want to show that one distribution is distributed "closer to 1" than the other. Ideally also showing that the contrast variable has no effect on this. Which test is best suited for this purpose?

I have tried an ANOVA test, but got nonsignificant results, probably because the means are almost equal.

Below is the code for the ANOVA test, which can be found together with the data here:

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.read_csv('data.csv')
print('Generic: ')
print(df[['Volume Conservation Factor', 'Processing']].loc[df['Processing'] == 'Generic'].describe(), '\n')
print('Generic Masked: ')
print(df[['Volume Conservation Factor', 'Processing']].loc[df['Processing'] == 'Generic Masked'].describe(), '\n')
dependent_variable='Volume Conservation Factor'
expression='Processing*Contrast'

formula = 'Q("{}") ~ {}'.format(dependent_variable, expression)

ols = smf.ols(formula, df).fit()
print(ols.summary())
print(ols.params)
anova = sm.stats.anova_lm(ols, typ=3)
print(anova)

Output:

Generic: 
       Volume Conservation Factor
count                   68.000000
mean                     0.963066
std                      0.072723
min                      0.575684
25%                      0.931986
50%                      0.974814
75%                      1.004041
max                      1.083536 

Generic Masked: 
       Volume Conservation Factor
count                   68.000000
mean                     0.966581
std                      0.036193
min                      0.901859
25%                      0.936341
50%                      0.967009
75%                      0.991662
max                      1.053411 

                                   OLS Regression Results                                  
===========================================================================================
Dep. Variable:     Q("Volume Conservation Factor")   R-squared:                       0.063
Model:                                         OLS   Adj. R-squared:                  0.042
Method:                              Least Squares   F-statistic:                     2.977
Date:                             Sun, 15 Mar 2020   Prob (F-statistic):             0.0340
Time:                                     11:16:01   Log-Likelihood:                 200.97
No. Observations:                              136   AIC:                            -393.9
Df Residuals:                                  132   BIC:                            -382.3
Df Model:                                        3                                         
Covariance Type:                         nonrobust                                         
================================================================================================================
                                                   coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------
Intercept                                        0.9797      0.010    101.947      0.000       0.961       0.999
Processing[T.Generic Masked]                    -0.0018      0.014     -0.130      0.897      -0.029       0.025
Contrast[T.CBV]                                 -0.0333      0.014     -2.450      0.016      -0.060      -0.006
Processing[T.Generic Masked]:Contrast[T.CBV]     0.0106      0.019      0.550      0.583      -0.027       0.049
==============================================================================
Omnibus:                       94.412   Durbin-Watson:                   1.207
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1112.649
Skew:                          -2.181   Prob(JB):                    2.46e-242
Kurtosis:                      16.317   Cond. No.                         6.85
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Intercept                                       0.979713
Processing[T.Generic Masked]                   -0.001771
Contrast[T.CBV]                                -0.033293
Processing[T.Generic Masked]:Contrast[T.CBV]    0.010571
dtype: float64
                        sum_sq     df             F         PR(>F)
Intercept            32.634447    1.0  10393.112525  2.159952e-127
Processing            0.000053    1.0      0.016976   8.965337e-01
Contrast              0.018843    1.0      6.001027   1.560797e-02
Processing:Contrast   0.000950    1.0      0.302506   5.832449e-01
Residual              0.414481  132.0           NaN            NaN
josef-pkt commented 4 years ago

First, you need to take into account that there is heteroscadasticity, Generic and Generic-Masked have different variances. The easiest is to use cov_type="hc3" or cov_type="hc2" which computes standard errors that are robust to unequal variances (analogue to using a Welch t-test or Anova). anova_lm does not handle robust cov_types, and you would need to use t_test and wald_test directly We have HetGLS to estimate the weights for WLS, but it hasn't been used much and I haven't looked at it in a long time. Alternative would be to estimate the variance for different cases (exogs) and use WLS directly.

show that one distribution is distributed "closer to 1" than the other

Maybe stats.stackexchange can provide more ideas for this.

Because of the difference in the variance (and distribution shape) between the two cases, "closer" can either be in terms of means/parameters or in terms of properties of the distribution (e.g. There is more probability in the neighborhood for one treatment than for the other)

We don't have much ready made for the second hypothesis. Most likely inference would have to be based on simulation or (heteroscedastic) bootstrap.

The hypothesis on means/parameters can be tested using results.t_test for single hypothesis and results.wald_test for joint hypothesis. But they don't directly support inequality hypothesis.

e.g. H0: 'predictead mean at exog' = 1 results.t_test("Intercept = 1") results.t_test("Intercept + Processing + Processing:Contrast = 1") results.t_test("Intercept + Contrast + Processing:Contrast = 1") results.t_test("Intercept + Processing + Contrast + Processing:Contrast = 1")

It's also possible to specify a different parameterization in the formula to get directly parameters for the 4 means. That could also provide confidence intervals directly for the 4 means.

Jimmy2027 commented 4 years ago

Trying the t_test gives me non-significant results for the Processing variable but significant results for the Contrast variable. Probably because the means are more different between contrast variables than between workflow variables. The following code

expression = 'Processing*Contrast'
formula = 'Q("{}") ~ {}'.format(dependent_variable, expression)
ols = smf.ols(formula, df).fit()
print(ols.summary())
test = ols.t_test("Intercept = 1")
print('Intercept: t value =', float(test.tvalue), 'p =',test.pvalue)
test = ols.t_test("Intercept + Processing[T.Generic Masked] + Processing[T.Generic Masked]:Contrast[T.CBV] = 1")
print('Processing: t value =', float(test.tvalue), 'p =',test.pvalue)
test = ols.t_test("Intercept + Contrast[T.CBV] + Processing[T.Generic Masked]:Contrast[T.CBV] = 1")
print('Contrast: t value =', float(test.tvalue), 'p =',test.pvalue)
test = ols.t_test("Intercept + Processing[T.Generic Masked] + Contrast[T.CBV] + Processing[T.Generic Masked]:Contrast[T.CBV] = 1")
print('Processing + Contrast: t value =', float(test.tvalue), 'p =',test.pvalue)

gives

Intercept: t value = -2.111065544175952 p = 0.03665236434724408
Processing: t value = -0.6901153872938636 p = 0.49133320747146403
Contrast: t value = -2.5839034892219144 p = 0.010856414920544209
Processing + Contrast: t value = -4.659713002248325 p = 7.637378547029597e-06

So I guess that it doesn't make sense in my case to compare the means. Would it make sense to compute a bootstrapped distribution of RMSE's ?

Jimmy2027 commented 4 years ago

I have tried to compute the bootstrapped distributions like the following. Does that make sense?

import pandas as pd
from numpy.random import choice
from sklearn.metrics import mean_squared_error
from math import sqrt
import statsmodels.formula.api as smf

volume_df = pd.read_csv('volume.csv')

generic_df = volume_df.loc[volume_df['Processing']=='Generic']
generic_CBV_df = generic_df.loc[generic_df['Contrast']=='CBV']
generic_BOLD_df = generic_df.loc[generic_df['Contrast']=='BOLD']
generic_masked_df = volume_df.loc[volume_df['Processing']=='Generic Masked']
generic_masked_CBV_df = generic_masked_df.loc[generic_masked_df['Contrast']=='CBV']
generic_masked_BOLD_df = generic_masked_df.loc[generic_masked_df['Contrast']=='BOLD']

bootstrapped_RMSEs = pd.DataFrame(columns=['Contrast', 'Processing', 'Uid', 'RMSE'])

N = 10000

for i in range(N):
    generic_CBV_df_temp = choice(generic_CBV_df['Volume Conservation Factor'].to_list(), len(generic_CBV_df['Volume Conservation Factor'].to_list()))
    generic_BOLD_df_temp = choice(generic_BOLD_df['Volume Conservation Factor'].to_list(), len(generic_BOLD_df['Volume Conservation Factor'].to_list()))
    generic_masked_CBV_df_temp = choice(generic_masked_CBV_df['Volume Conservation Factor'].to_list(), len(generic_masked_CBV_df['Volume Conservation Factor'].to_list()))
    generic_masked_BOLD_df_temp = choice(generic_masked_BOLD_df['Volume Conservation Factor'].to_list(), len(generic_masked_BOLD_df['Volume Conservation Factor'].to_list()))
    rmse_generic_masked_CBV = sqrt(mean_squared_error([1] * len(generic_masked_CBV_df_temp), generic_masked_CBV_df_temp))
    rmse_generic_masked_BOLD = sqrt(mean_squared_error([1] * len(generic_masked_BOLD_df_temp), generic_masked_BOLD_df_temp))
    rmse_generic_CBV = sqrt(mean_squared_error([1] * len(generic_CBV_df_temp), generic_CBV_df_temp))
    rmse_generic_BOLD = sqrt(mean_squared_error([1] * len(generic_BOLD_df_temp), generic_BOLD_df_temp))
    bootstrapped_RMSEs =bootstrapped_RMSEs.append(pd.DataFrame([['CBV', 'Generic', '{}'.format(i), rmse_generic_CBV]], columns=['Contrast', 'Processing', 'Uid', 'RMSE']))
    bootstrapped_RMSEs =bootstrapped_RMSEs.append(pd.DataFrame([['BOLD', 'Generic', '{}'.format(i), rmse_generic_BOLD]], columns=['Contrast', 'Processing', 'Uid', 'RMSE']))
    bootstrapped_RMSEs =bootstrapped_RMSEs.append(pd.DataFrame([['CBV', 'Generic Masked', '{}'.format(i), rmse_generic_masked_CBV]], columns=['Contrast', 'Processing', 'Uid', 'RMSE']))
    bootstrapped_RMSEs =bootstrapped_RMSEs.append(pd.DataFrame([['BOLD', 'Generic Masked', '{}'.format(i), rmse_generic_masked_BOLD]], columns=['Contrast', 'Processing', 'Uid', 'RMSE']))

dependent_variable = 'RMSE'
expression = 'Processing*Contrast'
formula = 'Q("{}") ~ {}'.format(dependent_variable, expression)
model = smf.mixedlm(formula, bootstrapped_RMSEs, groups='Processing')
fit = model.fit()
summary = fit.summary()
print(summary)

Which gives as a result:

                         Mixed Linear Model Regression Results
========================================================================================
Model:                       MixedLM           Dependent Variable:           Q("RMSE")  
No. Observations:            40000             Method:                       REML       
No. Groups:                  2                 Scale:                        0.0002     
Min. group size:             20000             Likelihood:                   114543.5184
Max. group size:             20000             Converged:                    Yes        
Mean group size:             20000.0                                                    
----------------------------------------------------------------------------------------
                                             Coef.  Std.Err.    z    P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept                                     0.060    0.014   4.346 0.000  0.033  0.087
Processing[T.Generic Masked]                 -0.018    0.020  -0.918 0.359 -0.056  0.020
Contrast[T.CBV]                               0.035    0.000 177.417 0.000  0.034  0.035
Processing[T.Generic Masked]:Contrast[T.CBV] -0.021    0.000 -77.155 0.000 -0.022 -0.021
Processing Var                                0.000                                     
========================================================================================
Jimmy2027 commented 4 years ago

@josef-pkt could you tell me if I employed your adivce and the relevant statsmodels functions correctly? I am not sure which function to use to analyze the bootstrapped distributions of the RMSEs.