py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.73k stars 706 forks source link

Bootstrap estimator for ORF #184

Open congcongruc opened 4 years ago

congcongruc commented 4 years ago

Hi! I'm using bootstrap to derive confidence interval for ORF. I notice that the bootstrap estimator relies on random sampling with replacement. This raises a concern with respect to ORF (forest-based) estimation. Because of sampling with replacement, it is likely that one single observation is sampled more than once for an ORF estimation process. This means that identical observations may be used for training the tree/forest/weights. Specifically, the identical observations will definitely go into the same leaf. This looks somehow wired, though I don't have a clear prediction on how this will influence estimating results.

I checked other forest-based methods, such as random forest, generalized random forest proposed by Athey et al. It seems that when growing trees repeatedly, random forest allows sampling either with or without replacement, while generalized random forest (causal forest) only allows sampling without replacement. As for ORF, I guess sampling without replacement is more suitable. Any suggestions on this? Thanks!

vasilismsr commented 4 years ago

This is a very interesting question! Here are my thoughts: 1) The problem that you raise will most probably be a finite sample error in the coverage of the bootstrap but will vanish asymptotically, so it will not have a big effect in coverage and bootstrap intervals will still be asymptotically nominal.

- The high level intuition: I think the better view of the bootstrap is that conditional on the empirical distribution (i.e. the original sample) each random sample in the bootstrap subsample should be thought of as an independent random variable drawn from the empirical distribution (as opposed to the correct distribution). So from this point of view, the sample splitting is correct and the whole estimation process, since each sample is independent of the other and your objection vanishes. It is just that these observations are correlated through the empirical distribution (the original sample), but the point is that asymptotically w.h.p. over the original sample, the empirical distribution is close to the population distribution and hence even conditional on the original sample, each sample in the bootstrap subsample can be thought as an independent random draw from the population distribution.
- The technical intuition is the following: Bootstrap is a good process for estimating the confidence intervals when your estimate is asymptotically linear (i.e. asymptotically looks like an average of independent random terms + lower order terms) and asymptotically normal.  See e.g. Theorem 2.1 of this chapter: https://arxiv.org/pdf/1809.04016.pdf . This theorem talks about a linear quantity, but you could in principle adapt it to account for the error from the lower order term. The Ortho forest estimates, similar to the casual forest estimates are asymptotically linear, where the leading term is what is referred to as the Hajek projection.  See e.g. Equation 50 and 52 here: https://arxiv.org/pdf/1901.03719.pdf

Potentially your observation points to an approach for a finite sample correction, but most of the methods for inference used here and in the literature typically only have asymptotic guarantees. In the bootstrap monte carlo experiment we run in the ortho forest paper: see Figure 4 here: https://arxiv.org/pdf/1806.03467.pdf we do get reasonable coverage (90% empirical coverage for a 98% confidence interval), though it does point to a finite sample error (these experiments were not very large scale due to the computational intensity of the bootstrap in this case).

2) Subsampling should definitely be more robust in terms of assumptions under which it works and hence it should have better finite sample properties. So you could try that out. We have plans of allowing for subsampling wihtout replacement in our bootstrap method: see here: https://github.com/microsoft/EconML/projects/8 and subsampling is one of this. For now you could just do your own for loop that draws half samples and runs the ortho forest. Another bootstrap method that is in the works is bootstrapping only the final stage, which for the ortho forest would keep the forest structures fixed and solely perturb the weights given to each sample in the final stage based on bootstrap resampling. This would also bypass the issue you raise.

3) Even better we expect to have within the next two weeks, confidence intervals for the ortho forest, based on the bootstrap of little bags idea in the paper of Athey and Wager. See here: https://github.com/microsoft/EconML/issues/104 I expect these to be more finite sample accurate, and also definitely they will be computationally less intensive. So you could just wait for this development if the arguments above for the bootstrap are not convincing.

congcongruc commented 4 years ago

Thanks!! Your intuition makes sense to me. I agree that this issue will not have a big effect on results, so I will simply use bootstrap with replacement at this stage. But if I have extra time after finishing my current project, I would like to work on alternative methods you mentioned. I also look forward to seeing new developments on bootstrap methods!