py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.77k stars 713 forks source link

Proper representation of estimated model in causal graph #744

Open carterrees opened 1 year ago

carterrees commented 1 year ago

I am working with the customer segmentation example and wanted to make sure I understand how the model would be represented in a DAG.

Specifically, the wording found here says "We assume we have data that are generated from some collection policy. In particular, we assume that we have data of the form: {Y_i(T_i), T_i, X_i, W_i, Z_i} where Y_i(T_i) is the observed outcome for the chosen treatment, T_i is the treatment, X_i are the co-variates used for heterogeneity, W_i are other observable co-variates that we believe are affecting the potential outcome Y_i(T_i) and potentially also the treatment T_i..."

The DAG shows that effect of price (T) on demand (Y), control variables (W) are adjusted for assuming that they block all back-door paths between Y and T and X.

Therefore, am I correct when I run the code from the notebook that is is properly represented by the DAG in the pic?

est = LinearDML(
    model_y=GradientBoostingRegressor(),
    model_t=GradientBoostingRegressor(),
    featurizer=PolynomialFeatures(degree=2, include_bias=False),
)
est.fit(log_Y, log_T, X=X, W=W, inference="statsmodels")

Screen Shot 2023-03-09 at 4 43 02 PM

kbattocchi commented 1 year ago

Yes, that looks correct to me (although it's also possible that there are additional arrows among the Ws and Xs - we are agnostic to that). The key to correctness is that there are no unobserved variables (that is, variables that don't belong to W or X) that affect Y.

For most of our estimators (including DML), there are some additional assumptions on the functional form of the relationships between Y, T, and X, such as that Y is linear in (a featurizaton of) T, with a coefficient that depends only on X.

carterrees-entrata commented 1 year ago

Thank you @kbattocchi. Understood about some of the assumptions as well. I'll have some more questions that I'll lay out in another thread related to this answer and Shapley values.