GRF refuse to report average treatment effect when the dataset is imbalanced

ginward commented 5 years ago

Description of the bug Download the dataset at this link. This is a dataset by LaLonde (1986), Dehejia and Wahba (1999) on the NSW training program. I am not sure if there is copyright - if so it belongs to them.

The data set is very imbalanced - there are around 10000 untreated example and less than 300 treated example.

When I run the code below in the development version, the average_treatment_effect(cps.forest, target.sample = "all") returns NaN. However, the code returns numbers in the releasedversion 0.10.2. It only returns NaN in the development branch.

Is this supposed to be a feature (i.e. returns NaN when data is very imbalanced), or is it actually a bug?

The average treatment effect for the treated and the overlap weighted treatment effects are all normal. Although the numbers differ slightly from the released branch as well.

Steps to reproduce

#compile the grf development code and include it into the library
library(dplyr)
select <- dplyr::select
TREES_CPS=1000
SEED=6000
set.seed(SEED)
dat=read.csv("bug.csv")
forest=causal_forest(seed=SEED,num.trees = TREES_CPS, X=as.matrix(select(dat, re74, re75, age, education, black, hispanic, married, nodegree)), Y=dat$re78, W=dat$treat)
average_treatment_effect(forest, target.sample = "all")

Output in Development Version

> average_treatment_effect(forest, target.sample = "all")
estimate  std.err 
     NaN      NaN

Output in Release Version

> average_treatment_effect(forest, target.sample = "all")
 estimate   std.err 
-2058.140  1401.162

GRF version development

swager commented 5 years ago

I tried running this on both the old and new versions of the package. First, on this dataset, the propensity scores get very close to 0 (they get down to 0.0005). So even though the old version gives a point estimate for the ATE, it also warns Estimated treatment propensities go as low as 0 which means that treatment effects for some controls may not be well identified...

The difference with the release version is that the forest that estimates the propensity scores appears to be splitting more aggressively, and now returns some propensity estimates that are exactly (rather than very nearly) zero. So that's why the ATE estimate comes out as NaN (because we're dividing by a zero propensity).

I'm removing the "bug" label, as what the development version is doing still appears to be reasonable (it's just estimating a small propensity score by exactly zero instead of a very small positive number). However, it may be worth investigating further why the development version makes more splits when estimating the propensity score here.

p.s. In cases like these, with very small propensity scores, it's often a good idea to focus on the average treatment effect on the treated instead of the overall average treatment effect.

ginward commented 5 years ago

@swager Thanks for you reply! I guess what you mean by release version is actually the development version? Because it seems that only the old version is released (0.10.2) and the one on GitHub is still not released yet. (Correct me if I am wrong).

I agree with you, when the data has propensities that are close to 0, it is better to use just treated treatment effects. What was puzzling me was the NaN, but I think what you said is right - it is because of the dividing by a value that is almost 0.

It is indeed interesting to find out why is it the case that the new version splits more vigorously ...

swager commented 5 years ago

Thanks, I fixed the typo in my previous comment. I discussed with @jtibshirani, and it appears that the more aggressive splitting is due to a subtle interaction between early stopping rules for tree splitting and honesty. It's not clear to me whether such more aggressive splitting is better or worse in practice (my hunch is that slightly more aggressive splitting may in fact be good, in that it gives us more power to detect failures in overlap), so I'm closing this ticket for now. However, if anyone comes across adverse effects of the more aggressive splitting in the development version, please let me know.

ginward commented 5 years ago

I tried running this on both the old and new versions of the package. First, on this dataset, the propensity scores get very close to 0 (they get down to 0.0005). So even though the old version gives a point estimate for the ATE, it also warns Estimated treatment propensities go as low as 0 which means that treatment effects for some controls may not be well identified...

The difference with the release version is that the forest that estimates the propensity scores appears to be splitting more aggressively, and now returns some propensity estimates that are exactly (rather than very nearly) zero. So that's why the ATE estimate comes out as NaN (because we're dividing by a zero propensity).

I'm removing the "bug" label, as what the development version is doing still appears to be reasonable (it's just estimating a small propensity score by exactly zero instead of a very small positive number). However, it may be worth investigating further why the development version makes more splits when estimating the propensity score here.

p.s. In cases like these, with very small propensity scores, it's often a good idea to focus on the average treatment effect on the treated instead of the overall average treatment effect.

@swager Just curious - because the calculation of ATE is actually from the following formula:

target.sample = "all": the ATE on the whole population, sum_{i = 1}^n E[Y(1) - Y(0) | X = X_i] / n. How is that "dividing by 0 propensity"? My understanding is that, because now there are more propensities that are 0, the 0s will go into the denominator. But it seems that, from the formula, the propensities doesn't go into the denominator at all.

ginward commented 5 years ago

Unless it is dividing by 0 in a step prior to calculating the average treatment effect ...

davidahirshberg commented 5 years ago

This formula defines the average treatment effect, but it cannot be used to calculate it, as it involves potential outcomes that we do not observe. For each i, we observe either Y_i(0) [if W_i=0] or Y_i(1) [if W_i=1] but not both. Our estimator does divide by the propensity score --- take a look at lines 183-210 of average_treatment_effect.R

ginward commented 5 years ago

@davidahirshberg Thanks! I will take a look.

ginward commented 5 years ago

This formula defines the average treatment effect, but it cannot be used to calculate it, as it involves potential outcomes that we do not observe. For each i, we observe either Y_i(0) [if W_i=0] or Y_i(1) [if W_i=1] but not both. Our estimator does divide by the propensity score --- take a look at lines 183-210 of average_treatment_effect.R

@davidahirshberg if the estimator does divide by the propensity score, does it mean that the ATEs are by default doubly robust?

ginward commented 5 years ago

@swager are you reopening this issue? 😂

swager commented 5 years ago

Hi @ginward, I reopened the issue while there's still discussion on it. We only actively monitor open tickets.

For the question -- the function average_treatment_effect implements a doubly robust ATE estimator. When estimating ATEs with machine learning methods, you essentially always want to use a doubly robust (or related) method to avoid bias due to regularization (see, e.g., the double machine learning paper from Chernozhukov et al.) For a discussion of what we implement exactly, see https://arxiv.org/pdf/1902.07409.pdf.

ginward commented 5 years ago

Thanks @swager

I have indeed read the application paper you wrote, and at the time I thought the doubly robust is only implemented for the cluster robust estimators.

But anyways - it looks like, when the fitted propensity is 0, the doubly robust method will not work (+infbasically). Is this what doubly robust meant to behave? What if the person, indeed, has lower propensity of treatment that is close to 0?

swager commented 5 years ago

Yes that's right. If there are some people whose treatment propensity is (essentially) zero, then the ATE is (essentially) not identified (because the treatment effect function is not identified in the region with poor overlap). So that's why any non-parametric estimate of the ATE will blow up in this case.

The only way around this is to change estimands, so that the target of inference is still identified even when some propensities are 0. One simple way to do this is to use the average treatment effect on the treated. Because the ATT only cares about the CATE function in regions where some people are treated, it doesn't matter that the CATE function is not identified in the region with 0 propensity of treatment (because that region just gets ignored by the ATT anyways).

ginward commented 5 years ago

@swager I see. This cleared things up a lot. Much appreciated!

jtibshirani commented 5 years ago

As a note, I walked back part of the change that caused forests to split more aggressively (#415). The behavior of the development version should now match that of 0.10.2 more closely.

I'm going to close this out, since it seems that this issue has been resolved. Thanks again @ginward for reporting the issue, it helped us catch an unintentional change in behavior (before we released it)!

ginward commented 5 years ago

@jtibshirani Sure thing!

minhengw commented 2 years ago

Dear all,

I have read all the posts above, and learned a lot. Very appreciated.

My question is similar, the output of average_treatment_effect is [1] NaN. However, I am using IV forest insteas of conventional causal forest, and the estimand of my case is the Average (Conditional) Local Average Treatment (ACLATE):

average_treatment_effect(ivforest, target.sample = "all")

The warning message also indicated poor overlap, but I can't change to estimate ATE on the treated due to IV forest. The error message is as below:

Error in average_treatment_effect(ivforest, target.sample = "treated") : For any forest type other than causal_forest, the only implemented option is method=AIPW and target.sample=all

I am eager to know how to handle this condition. Because IV forest is very essential in my research paper. Thank you very much.!

Warm regards, David

swager commented 2 years ago

The first thing I'd do in a situation like this is to make a histogram of the estimated compliance scores, i.e., estimates of Delta(x) = P[W = 1 | Z = 1, X = x] - P[W = 0 | Z = 0, X = x]. Are there any regions of X-space where these are reliably away from 0 or not? If yes, then focusing on those regions can help; but if no, then you have a weak IV problem and using AIPW for the ACLATE is simply not going to work.

Algorithmically, one thing you can do to focus on well-identified regions is to use the `subset' argument to focus on regions of X-space where treatment effects are well identified. For example, in the case of ATE estimation under unconfoundedness, the following paper considers subsetting to regions of X-space where propensity scores don't get too close to 0 or 1:

Crump, R. K., Hotz, V. J., Imbens, G. W., & Mitnik, O. A. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika, 96(1), 187-199.

The important thing to remember in doing so is that your subsetting should only be a function of X, not of Z / W / Y.

minhengw commented 2 years ago

Dear Dr. Swager,

Thanks! But I am NOT sure that in the case of IV forest, I should use the specific range of Z.hat or of W.hat as the subsample? And set it as subset in the average_treatment_effect( ).

Many thanks. David

grf-labs / grf

GRF refuse to report average treatment effect when the dataset is imbalanced #406