Identify subgroups. - Githubissues

py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.

Other

3.71k stars 700 forks source link

I agree interpretability is a question that is on our plate but have yet to implement some basic ways.

SHAP is a good first attempt at interpretability. It also offers feature importances, which can hint at which variables create heterogeneity and in which direction.

If your final model is a linear model you can simply inspect the coefficients in front of each variable.

Another way to sort of identify which subgroups suffer more is to post-process the final model and train a highly regularized classifier that predicts the sign of the heterogeneous treatment effect output by our CATE model (weighting also every sample by the magnitude of the effect). You can train a single tree for this and then print the structure of the tree. You could also then print the mean CATE for the samples contained in each leaf of the tree to get an understanding of the magnitude of the CATE of each leaf of the tree.

This solution is similar to the final step of Algorithm 1 proposed in the causal boosting paper here: https://arxiv.org/pdf/1707.00102.pdf

Here is for instance one example post-processing with a decision tree. Consider the following DGP:

import numpy as np
n = 10000
d = 100
d_x = 10
X = np.random.binomial(1, .5, size=(n, d))
T = np.random.binomial(1, .5, size=(n,))
y = (X[:, 0] + .5)*T + X[:, 0] + (0*X[:,0] + 1)*np.random.normal(0, 1, size=(n,))

We can estimate the heterogeneous treatment effect with the linear double ml estimator as follows:

from econml.dml import DMLCateEstimator, LinearDMLCateEstimator
from sklearn.linear_model import LinearRegression, LassoCV, LogisticRegressionCV
from sklearn.dummy import DummyClassifier
est = LinearDMLCateEstimator(model_y = LassoCV(cv=3),
                             model_t = LogisticRegressionCV(cv=3, solver='lbfgs'),
                             n_splits=6,
                             linear_first_stages=False,
                             discrete_treatment=True)
est.fit(y, T, X[:, :d_x], X[:, d_x:])

Then we can post-process the model to identify important subgroups by simply fittting a single regression tree trained on the outcome of the fitted model. We can also export the tree to graphviz for visualization:

import sklearn
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor(max_depth=3).fit(X, est.effect(X[:, :d_x]))
sklearn.tree.export_graphviz(reg, 'test.dot')

Then in a shell script you can visualize the tree (if you don't have graphviz, do sudo apt install graphviz):

dot -Tpng test.dot -o test.png

This will produce the following image: test

Where you can see which variables are chosen on each split and you can also see the mean treatment effect at each node (the "value" number). Thus you can see that most of the heterogeneity is produced by splitting on variable X[0] and every subsequent splits don't create much difference.

py-why / EconML

Identify subgroups. #84