py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.74k stars 707 forks source link

Orthogonal/Double ML: Bayesian regression to estimate the treatment effect from residuals? #282

Open ghost opened 4 years ago

ghost commented 4 years ago

Hello,

As noted in the EconML documentation of Orthogonal/Double ML, this method does the following steps and finally regress #1's residuals ~ #2's residuals:

1. predicting the outcome from the controls,

2. predicting the treatment from the controls;

As the same documentation says, "The approach allows for arbitrary Machine Learning algorithms to be used for the two predictive tasks, while maintaining many favorable statistical properties related to the final model (e.g. small mean squared error, asymptotic normality, construction of confidence intervals).".

"The main advantage of DML is that if one makes parametric assumptions on πœƒ(𝑋), then one achieves fast estimation rates and, for many cases of final stage estimators, also asymptotic normality on the second stage estimate πœƒΜ‚ , even if the first stage estimates on π‘ž(𝑋,π‘Š) and 𝑓(𝑋,π‘Š) are only 𝑛1/4 consistent, in terms of RMSE. For this theorem to hold, the nuisance estimates need to be fitted in a cross-fitting manner (see _OrthoLearner). The latter robustness property follows from the fact that the moment equations that correspond to the final least squares estimation (i.e. the gradient of the squared loss), satisfy a Neyman orthogonality condition with respect to the nuisance parameters π‘ž,𝑓. For a more detailed exposition of how Neyman orthogonality leads to robustness we refer the reader to [Chernozhukov2016], [Mackey2017], [Nie2017], [Chernozhukov2017], [Chernozhukov2018], [Foster2019]."

In the "Class Hierarchy Structure" section, the documentation very nicely introduced "DMLCateEstimator". "DMLCateEstimator assumes that the effect model for each outcome 𝑖 and treatment 𝑗 is linear, i.e. takes the form πœƒπ‘–π‘—(𝑋)=βŸ¨πœƒπ‘–π‘—,πœ™(𝑋)⟩, and allows for any arbitrary scikit-learn linear estimator to be defined as the final stage (e.g. ElasticNet, Lasso, LinearRegression and their multi-task variations in the case where we have mulitple outcomes, i.e. π‘Œ is a vector)." I noticed that sklearn.linear_model contains Bayesian regression, such as the Bayesian Ridge regression.

Before calling sklearn.linear_model.BayesianRidge() as our last-stage residual regressor, just want to make sure it's statistically solid to do that without violating any causal estimation assumption in Double ML framework? Say, our goal is to estimate the treatment effect. In the past, we have a few historical randomized test results that measured such treatment effect. So we want to use those randomized test results as the uninformative priors in sklearn.linear_model.BayesianRidge() in the last-stage residual regression, so that we can interpret the posterior result from this last-stage residual regression, as the treatment effect that we derive from our prior knowledge enhanced by observed data?

It's a little bit difficult for me to locate theoretical or academic reference to estimate treatment effect in the Double ML framework with Bayesian approach. So it would be really nice if you by any chance can help point to any of such reference?

Thank you!

vsyrgkanis commented 4 years ago

Most DML theory is frequentist. Albeit you can certainly use the Bayesian regression from sklearn as a first stage estimator. You can do cross validation to select among multiple models and if Bayesian regression gets the best out of sample MSE you can go with it. (e.g. using a GridSearchCV or using a Meta-Estimator that at fit time, chooses among multiple models using cross validation). See e.g. the GridSearchCVList we wrote here: https://github.com/microsoft/EconML/blob/master/notebooks/ForestLearners%20Basic%20Example.ipynb

Maybe there are also "frequentist" MSE properties of the bayesian regression methods implemented by sklearn, which would also give further justification that you can use it, as such MSE guarantees are the only things that the DML theory requires. Though you can sort of bypass the need for such guarantees by doing cross validation among multiple models and choosing the best. You can even use AutoML that will do this for you in an automated manner. See our notebook here of how to do this: https://github.com/microsoft/EconML/blob/master/notebooks/AutomatedML/Automated%20Machine%20Learning%20For%20EconML.ipynb

ghost commented 4 years ago

@vsyrgkanis Makes sense to me, THANKS a lot!! Yes will probably try Bayesian in the first-stage Y|X,W predictive task, to incorporate prior belief on treatment effect on outcome Y that comes from previous randomized experiments. Then use GridSearchCV to find the best model choice/ best parameter.

ghost commented 3 years ago

@vsyrgkanis I realized that there could be some concern, when our outcome variable Y is a binary variable.

I read this issue about binary outcome using Double ML: https://github.com/microsoft/EconML/issues/204. Have a few questions:

  1. You mentioned that you and the team derived a Double ML version for binary outcome, as specified in this paper (https://arxiv.org/pdf/1806.04823.pdf). You also pointed out a script to implement that Double ML binary method: https://github.com/vsyrgkanis/plugin_regularized_estimation/blob/1bcbad4803b2b7834477ab39051d40f7758c408b/logistic_te.py#L104

However, you mentioned "Since that method is not yet implemented and has not been stress tested I'm not sure it's the best option but you can try."

Is this script still in beta stage? i.e. we are not suggested to use? Or we can tweak that binary outcome Double ML code you shared now? E.g. Would it be a good option for me to follow your repository for your logistic + double ML paper? https://github.com/vsyrgkanis/plugin_regularized_estimation/tree/1bcbad4803b2b7834477ab39051d40f7758c408b

In your repo's README (), I looked at the section "ORTHOPY library", where it looks promising to use class LogisticWithOffsetAndGradientCorrection(), that "is an estimator adhering to the fit and predict specification of sklearn that enables fitting an "orthogonal" logistic regression"? Can I then specify it in Orthogonal/Double ML method like the following?


est = DMLCateEstimator(model_y=LogisticWithOffsetAndGradientCorrection(),
                       model_t=sklearn_classifier(),
                       model_final=sklearn_linear_regression())

If above looks reasonable, then how exactly would the last-stage "residual of Y ~ residual of treatment" look like? It would look like a logistic regression as well for the last-stage residual regression?

  1. What can be some theoretical risk by using the easier solution RegWrapper() in the util function script if we use it for binary outcome? You showed a simulation example, and that method looks okay from the causal result. Though I just want to understand from theoretical perspective is there certain situation that can cause severe bias of causal estimate, by calling that RegWrapper() on binary outcome?

  2. Alternatively, I saw sklearn.linear_model.RidgeClassifier(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier), "first converts the target values into {-1, 1} and then treats the problem as a regression task (multi-output regression in the multiclass case)." Is this generally applicable to double ML set-up? I.e. for binary outcome, any risk if we re-formulate the outcome variable Y as {-1, 1} label, and then apply the continuous variable's regression for stage-1 regression?

  3. I added a question to that binary outcome thread: https://github.com/microsoft/EconML/issues/204. Sorry I know it's a closed issue already but thought it could be relevant. Does Doubly Robust estimator also have this concern? I.e. what's the best practice if we want to use econml.drlearner for binary outcome? Would a logistic regression in the first-stage model_y violate the linear assumption? I found this Doubly Robust estimator class tutorial slides, to specify doubly robust estimator for logistic regression: https://www4.stat.ncsu.edu/~davidian/double.pdf. However, beyond this slide, most other papers I found about doubly robust, are specified for continuous outcome variable Y, and in Microsoft EconML package user guide for Doubly Robust Learner, can't find anywhere to specify binary outcome variable when using doubly robust method: https://econml.azurewebsites.net/spec/estimation/dr.html. So, would appreciate it if I can learn more about that.

Thank you!!