ATE over subsets with low and high estimated CATEs - nonsensical results

RamirezAmayaS commented 1 year ago

Description of the bug I am revisiting the analysis of Athey and Wager (2019). I am interested in running a falsification analysis where the causal forests are trained not on the student and school covariates but rather on randomly generated vectors. My prior is that the heterogeneity tests should fail to reject the null of no heterogeneity. However when comparing subsets with high and low estimated CATEs, the estimated average treatment effect on the high subset is close to zero and the estimated average treatment effect on the low subset is order of magnitudes larger. I can't find an explanation for this behavior. Is it a bug?

The other tests seem fine. The global ATE is close enough to the original results. The calibration test fails to reject the null of no heterogeneity.

Steps to reproduce

library(grf)

df = read.csv("experiments/acic18/synthetic_data.csv")

X = matrix(runif(n=nrow(df)*10),nrow=nrow(df))
X.colnames = c("RF1","RF2","RF3","RF4","RF5","RF6","RF7","RF8","RF9","RF0")

Z = df$Z
Y = df$Y

Y.forest = regression_forest(
    X , 
    Y
)

Y.hat = predict(Y.forest)$ predictions
Z.forest = regression_forest(
    X , 
    Z
)

Z.hat = predict(Z.forest)$predictions

cf.raw = causal_forest(
    X, 
    Y, 
    Z,
    Y.hat = Y.hat, 
    W.hat = Z.hat
)

varimp = variable_importance(cf.raw)
selected.idx = which(varimp > mean(varimp))

cf = causal_forest(
    X[,selected.idx], 
    Y, 
    Z,
    Y.hat = Y.hat, 
    W.hat = Z.hat,
    tune.parameters = "all"
)

tau.df = predict(cf,estimate.variance=TRUE)[,c(1,2)]
tau.hat = tau.df$predictions

# Distribution of predicted effects
hist(tau.hat)

# Average trearment effect
ATE = average_treatment_effect(cf)
paste(
    "95% CI for the ATE:", 
    round(ATE[1],3), 
    "+/-", 
    round(qnorm(0.975)*ATE[2],3)
)

Outputs: '95% CI for the ATE: 0.303 +/- 0.026'

# Compare regions with high and low estimated CATE
high_effect = tau.hat.unsorted > median(tau.hat.unsorted)
ate.high = average_treatment_effect(cf, subset=high_effect)
ate.low = average_treatment_effect(cf, subset=!high_effect)
paste(
    "95% CI for the difference in ATE:",
    round(ate.high[1] - ate.low[1],3),
    "+/-",
    round(qnorm(0.975)*sqrt(ate.high[2]^2 + ate.low[2]^2),3)
)

Outputs: '95% CI for the difference in ATE: -0.56 +/- 0.051'

average_treatment_effect(cf, subset=high_effect)

Outputs: estimate:-0.00124768810374905 std.err: 0.0182608951524164

average_treatment_effect(cf, subset=!high_effect)

Outputs: estimate: 0.608046759001875 std.err: 0.0182508648601049

# Test calibration
test_calibration(cf)

Outputs:

Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:

                                  Estimate  Std. Error t value Pr(>t)    
mean.forest.prediction            1.001729    0.041462  24.160 <2e-16 ***
differential.forest.prediction -682.911383   24.255158 -28.155      1    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

GRF version grf_2.2.1

erikcs commented 1 year ago

Hi @RamirezAmayaS, what you are observing is unfortunately a known artifact of doing these kinds of evaluations using Out-of-Bag (OOB) estimates. The suggested modern approach is to use the RATE with a training and evaluation sample. If you repeat your example from above, then you should see a flat TOC curve / zero RATE (when using a train/test split).

RamirezAmayaS commented 1 year ago

Hi @erikcs , thanks for the suggestion. I'll try the RATE approach. Do you know of any reference explaining why the OOB evaluation fails by any chance?

erikcs commented 1 year ago

I'm not sure about reference, but here is a simple example illustrating the issue with an OOB mean:

Let $Y_i \sim Bernoulli(\mu)$, $i=1...n$, with mean $\mu=0.5$.

Then $\mu^{(-1)} = \mu - (Y_i - \bar Y) / (n - 1)$ and

$E[Y_i | \mu^{(-1)} > 0.5] = 0$

$E[Y_i | \mu^{(-1)} < 0.5] = 1$.

RamirezAmayaS commented 1 year ago

Thanks for your reply.

I don't think I'm following. Shouldn't the OOB mean be $\mu{j}^{(-1)} = \frac{1}{(n-1)} \sum{i \neq j}{Y_i}$ ?

grf-labs / grf

ATE over subsets with low and high estimated CATEs - nonsensical results #1287