py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.87k stars 720 forks source link

Help to interpret shap values and CATE #614

Closed juandavidgutier closed 2 years ago

juandavidgutier commented 2 years ago

Hello @kbattocchi,

I have a model with a X variable=Hesitant and I can get the shap_values for my model too. The model has PolynomialFeatures(degree=3). Simillarly, I estimated the CATE for the variable Hesitant. These are the figures I obtain for shap values and CATE:

image

image

My question is how to interpret the CATE results? It means that if the X0 row in the figure of shape values shows that high values of Hesitant are more likely to give high values of output variable. But in the CATE figure, the high values of Hesitant trend to reduce the effect of treatment on the output variable.

I assume that the difference can be explained by the PolynomialFeatures argument, but if you can give me details, I'll appreciate it.

By the way, is possible that my shap values figure shows the name of the X variable?

I'll appreciate a lot your cooperation.

Here is my dataset dataset_covid.csv

and here is my code: ` import os, warnings, random import dowhy import econml from dowhy import CausalModel import pandas as pd import numpy as np import econml from econml.dml import DML, LinearDML, SparseLinearDML, NonParamDML, CausalForestDML from econml.dr import DRLearner, ForestDRLearner, SparseLinearDRLearner from econml.orf import DROrthoForest, DMLOrthoForest from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LassoCV from econml.inference import BootstrapInference import numpy as np, scipy.stats as st import arviz as az import scipy.stats as stats from econml.metalearners import TLearner, SLearner, XLearner, DomainAdaptationLearner from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor, GradientBoostingClassifier import matplotlib.pyplot as plt from zepid.graphics import EffectMeasurePlot from zepid.causal.causalgraph import DirectedAcyclicGraph from joblib import Parallel, delayed from econml.score import RScorer from plotnine import ggplot, aes, geom_line, geom_ribbon, ggtitle, labs import shap from sklearn.model_selection import train_test_split import warnings

def seed_everything(seed=123): random.seed(seed) np.random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) os.environ['TF_DETERMINISTIC_OPS'] = '1'

seed = 123 seed_everything(seed) warnings.filterwarnings('ignore') pd.set_option('display.float_format', lambda x: '%.2f' % x)

import data

data = pd.read_csv("D:/dataset_covid.csv", encoding='latin-1') data = data.dropna()

Constrained = data[['output50', 'Constrained', 'SVI', 'Access', 'HC_Accessibility_Barriers', 'Hesitant', 'Sociodemographic_Barriers', 'PopDensity', 'broadband']] print(Constrained.std()) Constrained.PopDensity = stats.zscore(Constrained.PopDensity) Constrained.Hesitant = stats.zscore(Constrained.Hesitant)

Y = Constrained.output50.to_numpy() #Y = data_card['incidencia100k_cardiovasculares'].values T = Constrained.Constrained.to_numpy() W = Constrained[['SVI', 'HC_Accessibility_Barriers', 'Access']].to_numpy().reshape(-1, 3) X = Constrained[['Hesitant']].to_numpy().reshape(-1, 1)

X_train, X_val, T_train, T_val, Y_train, Y_val, W_train, W_val = train_test_split(X, T, Y, W, test_size=.4)

warnings.filterwarnings('ignore')

reg1 = lambda: GradientBoostingClassifier() reg2 = lambda: GradientBoostingRegressor()

models = [ ('ldml', LinearDML(model_y=reg1(), model_t=reg2(), discrete_treatment=False, linear_first_stages=False, cv=3, random_state=123)), ('sldml', SparseLinearDML(model_y=reg1(), model_t=reg2(), discrete_treatment=False, featurizer=PolynomialFeatures(degree=3, include_bias=False), linear_first_stages=False, cv=3, random_state=123)), ('dml', DML(model_y=reg1(), model_t=reg2(), model_final=LassoCV(), discrete_treatment=False, featurizer=PolynomialFeatures(degree=3), linear_first_stages=False, cv=3, random_state=123)),

('ortho', DMLOrthoForest(model_Y=reg1(), model_T=reg2(), model_T_final=LassoCV(), model_Y_final=LassoCV(),

    #                     discrete_treatment=False, global_res_cv=3, random_state=123)),
     ('forest', CausalForestDML(model_y=reg1(), model_t=reg2(), 
                                featurizer=PolynomialFeatures(degree=3), 
                                discrete_treatment=False, cv=3, random_state=123)),
      ]

def fit_model(name, model): return name, model.fit(Y_train, T_train, X=X_train, W=W_train)

models = Parallel(n_jobs=-1, verbose=1, backend="threading")(delayed(fit_model)(name, mdl) for name, mdl in models)

Choose model with highest RScore

scorer = RScorer(model_y=reg1(), model_t=reg2(), discrete_treatment=False, cv=3, mc_iters=3, mc_agg='median')

scorer.fit(Y_val, T_val, X=X_val, W=W_val)

rscore = [scorer.score(mdl) for _, mdl in models] print(rscore)#best model SparseLinearDML

Step 1: Modeforestg the causal mechanism

model_Constrained=CausalModel( data = Constrained, treatment=['Constrained'], outcome=['output50'], graph= """graph[directed 1 node[id "Constrained" label "Constrained"] node[id "output50" label "output50"] node[id "SVI" label "SVI"] node[id "HC_Accessibility_Barriers" label "HC_Accessibility_Barriers"] node[id "Access" label "Access"] node[id "Hesitant" label "Hesitant"] edge[source "Access" target "SVI"] edge[source "Access" target "HC_Accessibility_Barriers"] edge[source "Access" target "Hesitant"] edge[source "Access" target "output50"] edge[source "Access" target "Constrained"] edge[source "SVI" target "Constrained"] edge[source "SVI" target "output50"] edge[source "HC_Accessibility_Barriers" target "Constrained"] edge[source "HC_Accessibility_Barriers" target "output50"] edge[source "SVI" target "HC_Accessibility_Barriers"] edge[source "SVI" target "Hesitant"] edge[source "HC_Accessibility_Barriers" target "Hesitant"] edge[source "Constrained" target "Hesitant"]
edge[source "Constrained" target "output50"] edge[source "Hesitant" target "output50"] ]""" )

view model

model_Constrained.view_model()

Step 2: Identifying effects

identified_estimand_Constrained = model_Constrained.identify_effect(proceed_when_unidentifiable=False) print(identified_estimand_Constrained)

Step 3: Estimating effects

estimate_Constrained = SparseLinearDML(model_y=reg1(), model_t=reg2(), discrete_treatment=False, featurizer=PolynomialFeatures(degree=3, include_bias=False), linear_first_stages=False, cv=3, random_state=123)

estimate_Constrained = estimate_Constrained.dowhy

estimate_Constrained.fit(Y=Y, T=T, X=X, W=W, inference='bootstrap')

estimate_Constrained.effect(X)

ate_Constrained = estimate_Constrained.ate(X) print(ate_Constrained)

ci_Constrained = estimate_Constrained.ate_interval(X) print(ci_Constrained)

shap

shap_values = estimate_Constrained.shap_values(X)

x0='Hesitancy'

shap.plots.beeswarm(shap_values['Y0']['T0']) shap.summary_plot(shap_values['Y0']['T0'], plot_type="violin")

CATE

range of hesitancy

min_Hesitant = -2.45 max_Hesitant = 2.35 delta = (max_Hesitant - min_Hesitant) / 100 X_test = np.arange(min_Hesitant, max_Hesitant + delta - 0.001, delta).reshape(-1, 1)

treatment_effects = estimate_Constrained.const_marginal_effect(X_test)

te_upper, te_lower = estimate_Constrained.const_marginal_effect_interval(X_test)

est2_Constrained = SparseLinearDML(model_y=reg1(), model_t=reg2(), discrete_treatment=False, featurizer=PolynomialFeatures(degree=2, include_bias=False), linear_first_stages=False, cv=3, random_state=123)

est2_Constrained.fit(Y=Y, T=T, X=X, inference="bootstrap")

treatment_effects2 = est2_Constrained.effect(X_test) te_lower2_cons, te_upper2_cons = est2_Constrained.effect_interval(X_test)

plot elasticity

( ggplot(aes(x=X_test.flatten(), y=treatment_effects2))

`

kbattocchi commented 2 years ago

From your code, it looks like you're plotting the results of est2_Constrained, which has a degree 2 polynomial rather than the degree 3 polynomial that corresponds to the shap results.

The plot basically says that if 'Hesitant' is 0, then a one-unit increase in T should cause a roughly 0.48 unit increase in Y, if 'Hesitant' is 2, then a one-unit increase in T should cause a roughly 0.1 unit decrease, etc. (I'm just reading off the approximate values from the chart).

As a side note, you can probably get more accurate confidence intervals from the default "debiasedLasso" inference rather than explicitly using "bootstrap".

I believe you should be able to pass the feature names through like this: shap_values = estimate_Constrained.shap_values(X, feature_names=['Hesitant'])

juandavidgutier commented 2 years ago

OK @kbattocchi, thanks for your useful cooperation.

Regards