Open Jimmy2027 opened 4 years ago
First, you need to take into account that there is heteroscadasticity, Generic and Generic-Masked have different variances.
The easiest is to use cov_type="hc3" or cov_type="hc2" which computes standard errors that are robust to unequal variances (analogue to using a Welch t-test or Anova).
anova_lm
does not handle robust cov_types, and you would need to use t_test
and wald_test
directly
We have HetGLS
to estimate the weights for WLS, but it hasn't been used much and I haven't looked at it in a long time. Alternative would be to estimate the variance for different cases (exogs) and use WLS directly.
show that one distribution is distributed "closer to 1" than the other
Maybe stats.stackexchange can provide more ideas for this.
Because of the difference in the variance (and distribution shape) between the two cases, "closer" can either be in terms of means/parameters or in terms of properties of the distribution (e.g. There is more probability in the neighborhood for one treatment than for the other)
We don't have much ready made for the second hypothesis. Most likely inference would have to be based on simulation or (heteroscedastic) bootstrap.
The hypothesis on means/parameters can be tested using results.t_test
for single hypothesis and results.wald_test
for joint hypothesis. But they don't directly support inequality hypothesis.
e.g. H0: 'predictead mean at exog' = 1 results.t_test("Intercept = 1") results.t_test("Intercept + Processing + Processing:Contrast = 1") results.t_test("Intercept + Contrast + Processing:Contrast = 1") results.t_test("Intercept + Processing + Contrast + Processing:Contrast = 1")
It's also possible to specify a different parameterization in the formula to get directly parameters for the 4 means. That could also provide confidence intervals directly for the 4 means.
Trying the t_test gives me non-significant results for the Processing variable but significant results for the Contrast variable. Probably because the means are more different between contrast variables than between workflow variables. The following code
expression = 'Processing*Contrast'
formula = 'Q("{}") ~ {}'.format(dependent_variable, expression)
ols = smf.ols(formula, df).fit()
print(ols.summary())
test = ols.t_test("Intercept = 1")
print('Intercept: t value =', float(test.tvalue), 'p =',test.pvalue)
test = ols.t_test("Intercept + Processing[T.Generic Masked] + Processing[T.Generic Masked]:Contrast[T.CBV] = 1")
print('Processing: t value =', float(test.tvalue), 'p =',test.pvalue)
test = ols.t_test("Intercept + Contrast[T.CBV] + Processing[T.Generic Masked]:Contrast[T.CBV] = 1")
print('Contrast: t value =', float(test.tvalue), 'p =',test.pvalue)
test = ols.t_test("Intercept + Processing[T.Generic Masked] + Contrast[T.CBV] + Processing[T.Generic Masked]:Contrast[T.CBV] = 1")
print('Processing + Contrast: t value =', float(test.tvalue), 'p =',test.pvalue)
gives
Intercept: t value = -2.111065544175952 p = 0.03665236434724408
Processing: t value = -0.6901153872938636 p = 0.49133320747146403
Contrast: t value = -2.5839034892219144 p = 0.010856414920544209
Processing + Contrast: t value = -4.659713002248325 p = 7.637378547029597e-06
So I guess that it doesn't make sense in my case to compare the means. Would it make sense to compute a bootstrapped distribution of RMSE's ?
I have tried to compute the bootstrapped distributions like the following. Does that make sense?
import pandas as pd
from numpy.random import choice
from sklearn.metrics import mean_squared_error
from math import sqrt
import statsmodels.formula.api as smf
volume_df = pd.read_csv('volume.csv')
generic_df = volume_df.loc[volume_df['Processing']=='Generic']
generic_CBV_df = generic_df.loc[generic_df['Contrast']=='CBV']
generic_BOLD_df = generic_df.loc[generic_df['Contrast']=='BOLD']
generic_masked_df = volume_df.loc[volume_df['Processing']=='Generic Masked']
generic_masked_CBV_df = generic_masked_df.loc[generic_masked_df['Contrast']=='CBV']
generic_masked_BOLD_df = generic_masked_df.loc[generic_masked_df['Contrast']=='BOLD']
bootstrapped_RMSEs = pd.DataFrame(columns=['Contrast', 'Processing', 'Uid', 'RMSE'])
N = 10000
for i in range(N):
generic_CBV_df_temp = choice(generic_CBV_df['Volume Conservation Factor'].to_list(), len(generic_CBV_df['Volume Conservation Factor'].to_list()))
generic_BOLD_df_temp = choice(generic_BOLD_df['Volume Conservation Factor'].to_list(), len(generic_BOLD_df['Volume Conservation Factor'].to_list()))
generic_masked_CBV_df_temp = choice(generic_masked_CBV_df['Volume Conservation Factor'].to_list(), len(generic_masked_CBV_df['Volume Conservation Factor'].to_list()))
generic_masked_BOLD_df_temp = choice(generic_masked_BOLD_df['Volume Conservation Factor'].to_list(), len(generic_masked_BOLD_df['Volume Conservation Factor'].to_list()))
rmse_generic_masked_CBV = sqrt(mean_squared_error([1] * len(generic_masked_CBV_df_temp), generic_masked_CBV_df_temp))
rmse_generic_masked_BOLD = sqrt(mean_squared_error([1] * len(generic_masked_BOLD_df_temp), generic_masked_BOLD_df_temp))
rmse_generic_CBV = sqrt(mean_squared_error([1] * len(generic_CBV_df_temp), generic_CBV_df_temp))
rmse_generic_BOLD = sqrt(mean_squared_error([1] * len(generic_BOLD_df_temp), generic_BOLD_df_temp))
bootstrapped_RMSEs =bootstrapped_RMSEs.append(pd.DataFrame([['CBV', 'Generic', '{}'.format(i), rmse_generic_CBV]], columns=['Contrast', 'Processing', 'Uid', 'RMSE']))
bootstrapped_RMSEs =bootstrapped_RMSEs.append(pd.DataFrame([['BOLD', 'Generic', '{}'.format(i), rmse_generic_BOLD]], columns=['Contrast', 'Processing', 'Uid', 'RMSE']))
bootstrapped_RMSEs =bootstrapped_RMSEs.append(pd.DataFrame([['CBV', 'Generic Masked', '{}'.format(i), rmse_generic_masked_CBV]], columns=['Contrast', 'Processing', 'Uid', 'RMSE']))
bootstrapped_RMSEs =bootstrapped_RMSEs.append(pd.DataFrame([['BOLD', 'Generic Masked', '{}'.format(i), rmse_generic_masked_BOLD]], columns=['Contrast', 'Processing', 'Uid', 'RMSE']))
dependent_variable = 'RMSE'
expression = 'Processing*Contrast'
formula = 'Q("{}") ~ {}'.format(dependent_variable, expression)
model = smf.mixedlm(formula, bootstrapped_RMSEs, groups='Processing')
fit = model.fit()
summary = fit.summary()
print(summary)
Which gives as a result:
Mixed Linear Model Regression Results
========================================================================================
Model: MixedLM Dependent Variable: Q("RMSE")
No. Observations: 40000 Method: REML
No. Groups: 2 Scale: 0.0002
Min. group size: 20000 Likelihood: 114543.5184
Max. group size: 20000 Converged: Yes
Mean group size: 20000.0
----------------------------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 0.060 0.014 4.346 0.000 0.033 0.087
Processing[T.Generic Masked] -0.018 0.020 -0.918 0.359 -0.056 0.020
Contrast[T.CBV] 0.035 0.000 177.417 0.000 0.034 0.035
Processing[T.Generic Masked]:Contrast[T.CBV] -0.021 0.000 -77.155 0.000 -0.022 -0.021
Processing Var 0.000
========================================================================================
@josef-pkt could you tell me if I employed your adivce and the relevant statsmodels functions correctly? I am not sure which function to use to analyze the bootstrapped distributions of the RMSEs.
I have two distributions of volume conservation factors (VCF) Generic and Generic Masked that I want to compare. A visualization of the two distributions can be found here: https://imgur.com/a/tgUp7in
The VCF being optimal if equal to 1, I want to show that one distribution is distributed "closer to 1" than the other. Ideally also showing that the contrast variable has no effect on this. Which test is best suited for this purpose?
I have tried an ANOVA test, but got nonsignificant results, probably because the means are almost equal.
Below is the code for the ANOVA test, which can be found together with the data here:
Output: