Wide confidence interval for ATE

shibakoichiro commented 3 years ago

I get very large confidence intervals for ATE estimate that include the null value of zero. However, when I check the distributions of CATE, they were all greater than zero. I know ATE is being estimated using doubly-robust estimator rather than the grf algorithm, but I am struggling to make sense of the results. Is it natural to have a result like this (i.e., CATE being positive while the confidence interval for ATE includes the null value)? I used the following code to estimate CATE and ATE.

forest <- causal_forest(X,Y,W,num.trees = 10000, tune.parameters = "all") preds.cate <- predict(forest, estimate.variance = F) summary(preds.cate$predictions)

ATE <- average_treatment_effect(forest, target.sample = "all", method = "TMLE")

erikcs commented 3 years ago

Hi @shibakoichiro, generally speaking even though all point estimates of something are above zero the mean does not necessarily have to be significantly different from zero?

You can tweak the signal/noise ratio in a DGP and get both behaviors:

set.seed(123456789)
n <- 500
p <- 20
X <- matrix(rnorm(n * p), n, p)
W <- rbinom(n, 1, 0.5)
Y <- pmax(X[, 1], 0) * 0.15 * W + X[, 2] + pmin(X[, 3], 0) + 0.5 * rnorm(n)
cf <- causal_forest(X, Y, W)

average_treatment_effect(cf)
# estimate    std.err 
# 0.07398224 0.04933766 

summary(predict(cf)$predictions)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.01403 0.05652 0.07168 0.07374 0.08907 0.16675

erikcs commented 3 years ago

Also, the best linear projection might be useful:

Here is similar illustration in a case with low signal where one covariate matters for tau:

set.seed(123)
n <- 1000
p <- 10
X <- matrix(rnorm(n * p), n, p)
W <- rbinom(n, 1, 0.5)
Y <- pmax(X[, 1], 0) * W + X[, 2] + pmin(X[, 3], 0) + 2 * rnorm(n)
cf <- causal_forest(X, Y, W)

average_treatment_effect(cf)
# estimate   std.err 
# 0.1891587 0.1324706

As you expect you are barely able to detect an average effect. However, if you plot CATEs against X1 there seems to be a positive effect:

X.test <- matrix(0, 25, p)
X.test[, 1] <- seq(-2, 2, length.out = 25)
# (if you plot confidence bars they will be very wide due to the low signal)
pp <- predict(cf, X.test, estimate.variance = TRUE) 
plot(X.test[,1], pp$predictions, type = "l")

Screen Shot 2020-11-28 at 11 21 52

And in this case the best linear projection on X1 (details: https://arxiv.org/abs/1702.06240) provides some support for this:

best_linear_projection(cf, X[,1])
# Best linear projection of the conditional average treatment effect.
# Confidence intervals are cluster- and heteroskedasticity-robust (HC3):
#   
#   Estimate Std. Error t value Pr(>|t|)   
# (Intercept)  0.18239    0.13196  1.3821 0.167235   
# A1           0.41962    0.13728  3.0567 0.002298 **
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

adeldaoud commented 3 years ago

Thanks for the examples @erikcs. I have two follow up

A) So to summarize your point: if the noise is large relative to the W effect, then we will see that the ATE will have a wide confidence interval. In this scenario, merely plotting the CATE will wrongly indicate that heterogeneity is present. What we are seeing is merely noise. Is this a correct interpretation?

In the DGP, the noise is controlled by "+2 rnorm(n)" in "Y <- pmax(X[, 1], 0) W + X[, 2] + pmin(X[, 3], 0) + 2 * rnorm(n)"

B) What is the intuition of the best linear projection function? I see that the coefficient of the intercept corresponds to the ATE but what does the coefficient of A1 (which refers to X1?) corresponds to? And what do the p-values correspond to. I read the manual, but I am still trying to wrap my head around it.

Thanks

erikcs commented 3 years ago

Edit: I marked my first post as outdated as sticking to the second example is more clear

@adeldaoud,

Yes, in that example noise was increased by just multiplying the error term with 2.

And it is just an example, you should not draw any generic statistical insight from it. It was just a simple illustration of marginally powered data. Remember even if you fail to detect an effect, does not mean that there isn't one present, and vice versa - we can't give a canned recipe for this as inference is a very nuanced undertaking.

And back to the example (in https://github.com/grf-labs/grf/issues/757#issuecomment-735280472) with the best linear projection (BLP), here is a summary:

In that example the signal is low compared to the noise, and it turned out we were not able to detect an average effect, as average_treatment_effect(cf) shows, even though we know one is present.

In this contrived example we know that X1 has a positive effect (in a real trial focusing on this covariate could be pre-registered), and the plot seem to support this, though with caution, because if you include the pointwise CIs they are very wide.

However, a plot in itself does not carry too much evidence, it is complicated by a fundamental issue with non-parametric inference on treatment effects tau(x): it is a considerably hard statistical problem, as tau(x) may be a very complex function, and pointwise confidence intervals may thus be very wide.

This is where the BLP comes in: in https://arxiv.org/abs/1702.06240 @vsemenova shows that if we switch our focus from tau(x) around to a more "coarser" summary, namely some linear projection on tau(x), then we may actually do so with good statistical accuracy (where the exact definition of "good statistical accuracy" is in the paper). This linear projection is defined as:

{b_0, b} = argmin E[(tau(x) - b_0 - Ab)^2],

where A is some chosen set of covariates (typically a subset, or transformation of the X's).

What this means for the above example is that: we suspect that X1 has a positive effect on tau, and we want to present statistical evidence towards that further than just a plot, and we can do that with a BLP: if we project tau on X1 we should expect a positive and significant coefficient, and that is what the BLP returns in this example. Hence, even though we were not able to detect an ATE and pointwise CIs are quite wide, we could present some evidence suggesting that there might be a benefit, by employing the BLP.

(yes, A1 is the same as X1, A* is just the default variable name output. If you think this is helpful then maybe we should consider having a vignette on this)

adeldaoud commented 3 years ago

Thanks, @erikcs

First, let me see if I got your BLP explanation (intuition) right by summarizing it. We fit a GRF to evaluate the causal effect of W on Y, conditional on a set of X. We may look for whether the ATE is substantively and statistically significant as well as whether some subspace of CATE (conditional on some X) is significant. After fitting a GRF, we can calculate the ATE straightforwardly, using “average_treatment_effect”. However, evaluating CATE is not as straightforward because we cannot be sure if the variation that we observe (e.g., in a histogram of CATE over a subspace of X) is just random noise or a true effect heterogeneity.

BLP provides a way to assess if the CATE variation is true (systematic) or noise (random) for a set of Xs. As BLP relies on parametric statistics, we can utilize those statistical properties to present stronger statistical evidence to whether CATE is systematic for some X_i (i.e., the parameter of X_i is statistically significant) than merely plotting CATE.

Is my interpretation of your explanation correct?

Second, if my interpretation is correct, then is the following statement correct. As BLP consists of a linear model, we would have to manually specify an interaction effect if we believe that X_1 and X_2 interact in producing a combining CATE? We would then need to probe the marginal effects to disentangle their joint effect over tau fully.

Third, re "If you think this is helpful then maybe we should consider having a vignette on this", yes that would be quite helpful for applied researchers to see a vignette on this a bit more fleshed out intuition-wise.

vsemenova commented 3 years ago

Hello!

>>I get very large confidence intervals for ATE estimate that include the null value of zero. However, when I check the distributions of CATE, they were all greater than zero.

It can happen that the ATE confidence interval contains zero while all of the predicted CATE values are positive. A confidence interval is an object that covers the true value of ATE with a pre-specified probability (usually, 0.95). You must guarantee coverage of ATE but you are allowed to report the lower and the upper bound (i.e., be ambiguous) instead of 1 number.

Predicted CATE values are the "best point guesses" of true CATE values. You are trying to make the best guess on average but are not required to guess any particular point correctly.

Having a wide ATE (covering zero) tells me you have a lot of uncertainty around your ATE point estimates. One reason it could happen is having the propensity score p(X) (i.e., the conditional probability of treatment assignment) concentrated around 0 or 1 value. (e.g., almost all males/females are assigned to treatment/control group). Since the denominator of the ATE's variance is proportional to p(X)*(1-p(X)), which is close to zero in this case, the variance of your ATE's estimator blows up.

grf-labs / grf

Wide confidence interval for ATE #757