py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.76k stars 713 forks source link

How to get the Confidence Interval for ATE instead of CATE #870

Open ludovico-lanni opened 6 months ago

ludovico-lanni commented 6 months ago

Hello!

I have been using the EconML library for some time now and I am not sure what is the way to use a DML object to make inference about the ATE, without conditioning the results on a given set of features X.

All the methods that I've seen in the docs, like ate(), ate_interval() require an X whenever that X is used in the fitting process. What they return is CATE, conditioned on X. However, imagine I want to use DML methods to reduce the variance of a causal estimator that I want to use on experimental data (where the treatment is randomised), and even tho I surely want to add interactions with a set of controls X and non-linearity conditions in the model (that's why using lasso or non-parametric DML), I am anyways interested in just one summarising number (the ATE) and its confidence interval.

I guess that I can get the ATE by doing the mean of the CATE calculated on my sample X. But what about the confidence interval of the ATE?

kbattocchi commented 5 months ago

For most of our estimators, estimation is centered on providing the CATE and the mechanics of the estimation process do not automatically result in the computation of an ATE of the training population at the same time. We provide the ate, ate_interval, and ate_inference methods as a convenience to compute the ATE averaged over any population by taking the CATE estimates for that population and averaging them; ate_interval provides confidence intervals, and ate_inference provides not only confidence intervals but also p-values, etc., and so you can get what you're after by using the same set of Xs as an argument to ate_inference as you used when training the estimator, although this will not necessarily be a very precise estimate of the ATE.

One exception to this general rule is CausalForestDML, which does compute a doubly-robust estimate of the ATE as part of the estimation process (if drate=True, which it is by default) - to access it, use the ate_ attribute or the ate__inference method (with an extra underscore compared to the standard method). This should give a more precise estimate with tighter confidence intervals compared to the approach that averages CATEs.