scunning1975 / mixtape

Data and Program files for Causal Inference: The Mixtape
Other
385 stars 217 forks source link

randomization inference p-values #26

Open alexanderthclark opened 3 years ago

alexanderthclark commented 3 years ago

I'm getting a different p-value than is calculated in ri.py or commented in ri.do. I only know python (excited by the new python code added!) so I'll reference the py file.

I believe the issue is with the line p_value = p_value[p_value['permutation'] == 1] which doesn't calculate a p-value based on a weak nor strict inequality. Prior to that line, signed t-statistics are ordered and ranked. There are several observations with the same t-stat of 1. So, if we wanted the p-value calculation to use a weak inequality (find the share of observations with a weakly higher ATE), the following minimal edits would do the job.

p_value = p_value[p_value['ate'] == 1] 
p_value['rank'].max() / n

This gives 0.4285.

The simplest code I can think to do the same thing is the following, though it's not the smartest because it relies on permutations instead of combinations.

from itertools import permutations
import pandas as pd
import numpy as np

url = 'https://github.com/scunning1975/mixtape/raw/master/ri.dta'
df = pd.read_stata(url, index_col = 'name')
observed_t_stat = 1
y_vec = df.y.values 

# create vector of treatment assignments
# use -1 instead of 0 for dot product assist
d = np.concatenate( [np.ones(4), (-1)*np.ones(4)] ) 

t_stats = np.array([])
for d_vec in permutations(d):    
    t = np.dot(y_vec, d_vec) / 4 # signed t-stat
    t_stats = np.append(t_stats, t)

p_value = (t_stats >= observed_t_stat).mean()

I'm making this an issue, because I want to check my own understanding (I'm self-studying) and I think there's also the issue of whether or not the code should be using absolute values for the t-statistics to match the book.

scunning1975 commented 3 years ago

I never saw this, and I apologize for not responding. I didn't write the python code, so I need to look into this more.