Receiving an NaN estimate and standard error in estimate of average treatment effect for the treated

JREED1234 commented 6 years ago

Hello:

I am working on a project with a large number of observations (around 300,000 or so). In this project, there are approximately 20 explanatory variables that are useful in determining the treatment of the individual (W). I have tried to use the causal_forest and estimate_average_effect functions and, sometimes, it returns an NaN response.

While I cannot go into the nuances of the data too much, approximately 30% of the sample receives the treatment. The Y variable is binary, so we are looking at the percentage change in behavior based on the treatment. For this outcome variable, we see an overall rate (for both treatment and control groups combined) of 1 being around 14%, so the outcome being "positive" is quite low across the entire sample. Do you have any suggestions on how I might be able to avoid the NaN? As I mentioned, sometimes I will get an estimated treatment effect, other times the NaN will be returned. Should I increase the sample fraction? The min node size?

My thought is that there must be a tree in the sample that is causing the NaN issue (since there are occasions where I do get a numerical response rather than NaN). Is there a way I can view each tree in isolation and take out any with an NaN?

Any suggestion would be appreciated. Thank you for your time.

swager commented 6 years ago

Interesting. Is the treatment W observational or randomized? Are there some regions of the data where effectively everyone (or no one) gets treated?

Some things to try that may help:

If you train a forest with more trees, does it improve the situation? How many trees are you currently using?
If you look at the estimated propensities, are there many estimated propensities that are close to 0 or 1? You can use the following commands:
```
tau.forest = causal_forest(X, Y, W)
hist(tau.forest$W.hat)
```
Continuing along this line, if there are many estimated propensities near 0, that means that there is a large part of covariate space where essentially no one is treated, and so the average treatment effect isn't well identified. What happens is you look at the average treatment effect on the treated instead (target.sample="treated" in estimate_average_effect)?
Can you create a simulation example that you could share, and that reproduces the issue?

JREED1234 commented 6 years ago

Thank you for your quick response. The treatment W is observational, based on characteristics observed by the firm in determining who gets what treatment. There are some boundaries where the number of treated vs. not is low, but I don't think there is any spot where it is completely blank.

Right now, I am training with 2,000 trees, the standard amount. Right now, it takes approximately an hour and a half to produce the results (NaN's for the ATE and ATT. The ATE of the Control Group does show a result and has never shown an NaN).

In looking at your comments above, I believe I am running into a situation with some polar results in propensity scores (i.e. a large frequency of 0's and 1's, not much in the middle). I have attached the histogram as requested to this message. propensity scores hist

I will do what I can to try and come up with some sort of simulated data, provided we can circumvent the proprietary nature of the data. If you have any suggestions or thoughts in the interim, I would appreciate it.

swager commented 6 years ago

Got it. From a scientific point of view, it's not clear how robustly the treatment effects are identified for those subjects whose treatment effect is very close to 0, as identification in the potential outcomes model relies heavily on overlap (i.e., propensities are bounded away from 0 and 1).

There are two popular approaches for dealing with this issue:

You could try filtering the data, and only estimating the ATE over the subgroup whose propensities are within a reasonable range (e.g., 0.05 <= W.hat <= 0.95). Concretely, you could (1) run a regression forest of W on X, (2) only keep those observations that satisfy the inequalities, and (3) re-run a causal forest on that subset. This idea was proposed by Crump et al. (2009).
You could estimate an ``overlap-weighted ATE'' in the sense of Li et al. (2017); this effectively weights the estimand towards parts of covariate space where treatment effects are identified. Procedurally, this amounts to first training a causal forest on all the data, and the estimating the WATE by running OLS of Y - Y.hat on W - W.hat (this is sometimes called a residual-on-residual regression). Note that if we believed that tau(x) was constant in x, then this corresponds to the classical estimator of Robinson (1988) for the treatment effect.

Given a large enough number of trees, the NaN issue should go away, but the confidence intervals may still be very large, because estimating an average effect involves dividing by estimated propensities or 1 - propensities.

In more detail -- my guess is that the reason you're getting NaNs is that there are some parts of feature space with no variation at all in Wi over all the trained trees, and so the forest cannot estimate tau(x) there. Given more trees, it may be possible to find a tiny bit of variation in Wi, but then the numerical issue will morph into a statistical one: There's no longer a division by a numerical 0, but there's still division by what amounts to a statistical 0.

References: Crump, R. K., Hotz, V. J., Imbens, G. W., & Mitnik, O. A. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika, 96(1), 187-199. Li, F., Morgan, K. L., & Zaslavsky, A. M. (2017). Balancing covariates via propensity score weighting. Journal of the American Statistical Association, 1-11. Robinson, P. M. (1988). Root-N-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 931-954.

JREED1234 commented 6 years ago

Thank you so much for your thoughtful and detailed response. This has certainly given me a new direction to take the results I have obtained. I appreciate the promptness and detail in helping me out with this.

swager commented 6 years ago

Great! Also, in case it's useful, I added a function that does the overlap-weighted estimator.

grf-labs / grf

Receiving an NaN estimate and standard error in estimate of average treatment effect for the treated #176