py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.74k stars 706 forks source link

Double Machine learning for binary treatment variable #534

Closed ShaileeImrith closed 2 years ago

ShaileeImrith commented 2 years ago

Thank you for writing this package; it's been extremely useful. I was wondering if it makes any difference if a classifier is used to predict D (model_d) instead of a regression model - my treatment variable is binary (0,1)? This would then mean that residuals_d will take only three values (0,1,-1). Is there any reason why one should still use a regression model for D?

MasaAsami commented 2 years ago

I don't think that's a problem. Because if D is binary, the residuals will be "D(1or0) - estimated probability", which is a continuous variable, since self._model.predict_proba is used. https://github.com/microsoft/EconML/blob/master/econml/dml/dml.py

ShaileeImrith commented 2 years ago

Hi, thank you for your comment. I am still a bit confused though. A classifier would predict binary values (0 or 1, like say, sklearn's randomforestclassifier) and therefore the residuals_d will also take discrete values, not continuous. So my question is : is there anything conceptually wrong with residuals_d being discrete (values can be 0, 1 or -1)? The final ols step then is a linear regression of the continuous residuals_y on the discrete residuals_d?

MasaAsami commented 2 years ago

Hi!

is there anything conceptually wrong with residuals_d being discrete (values can be 0, 1 or -1)? The final ols step then is a linear regression of the continuous residuals_y on the discrete residuals_d?

I think that the output of the treatment model (classifier) should be a continuous variable (assigned probability of 0~1) rather than discrete (1 or 0), if at all (since it should essentially work as a propensity socre). If we leave the output as binary, we will lose a lot of information about the allocation tendency. [Example.] If T_i(observable) = 1:. [output is continuous] When P[T_i|X_i]=0.6, residuals: 1-0.6 = 0.4 When P[T_i|X_i]=0.9, residual: 1-0.9 = 0.1 [Output is binary] When P[T_i|X_i]=0.6, residual: 1-1 = 0 When P[T_i|X_i]=0.9, residuals: 1-1 = 0

And also If we make the treatment residuals discrete, there is a risk that their variance will be smaller than it needs to be. If the variance of the treatment is small, the variance of the final estimator will be large.

I made a sample code (some of it is in Japanese, sorry). I got the data from here (https://rdrr.io/cran/DoubleML/src/R/datasets.R).

t_res = e401k (observable 1or0) - first_t_model.predict_proab(estimated allocation probability)

image

image

image

kbattocchi commented 2 years ago

@ShaileeImrith If your treatment is discrete, then you should pass discrete_treatment=True to the DML initializer. Then we will expect the model_t to be a classification model, and as @MasaAsami says, we will use the predicted class probabilities (predict_proba output of the classifier) rather than the actual class prediction (predict of the classifier) when computing the residuals.

kbattocchi commented 2 years ago

But if your question is merely out of curiosity, it's not clear that it would be "wrong" to use the discrete residuals of treating a classifier as if it were a regressor, as long as the assumptions of the DoubleML model are satisfied. In general, though, you'd expect the residuals to have higher variance than if you were using the predicted probabilities, which would probably lead to a noisier estimate.

ShaileeImrith commented 2 years ago

Okay, thank you for your reply. Much appreciated:)