CausalForestDML SHAP values not working properly

py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.

https://www.microsoft.com/en-us/research/project/alice/

Other

3.88k stars 720 forks source link

CausalForestDML SHAP values not working properly #464

Open jbel1026 opened 3 years ago

jbel1026 commented 3 years ago

Hello,

I am using econml version 0.10.0 and shap v. 0.39.0

When I calculate SHAP values for CausalForestDML estimator the results do not make sense, as opposed to when I calculate them the same way on other estimators like DRL/Meta-learners.

Here is a brief description of my issue:

Y = ['event_flag'] T = ['Diabetes_A1c_Test_compliance_category_Non-Compliant']

So I am able to access my shap values using: shap.plots.bar(shap_values[Y[0]][T[0]+'_1.0']) for non-CausalForestDML models, however, for CausalForestDML the shap values take a different form:

shap.plots.bar(shap_values['Y0']['T0_1.0']), even though everything else has been set up the same

Additonally, the results dont make sense:

For example, for DRL I see about what I expect: (clear directional relationships between the impact and the value of the feature, for the most part)

But for DML (built the same way, only difference is the estimator used), the results are not at all what I would expect: (very little directional relationships)

Thanks!

vsyrgkanis commented 3 years ago

Thanks @jbel1026 ! This looks something we should definitely look into!

If you could send us any fake data that would re-create the same problem, trying to replicate the above behavior somehow would be fantastic for us to understand why this might be happening, as otherwise we have no way to replicate the problem.

Also if you could also paste snippets of code that generated these plots, especially the second plot would also be helpful.

Best, V

vsyrgkanis commented 3 years ago

Try the econml 0.11 too just in case (we made some small changes in how we call shap), but I don't that would fix the problem.

jbel1026 commented 3 years ago

@vsyrgkanis I believe that the update to 0.11 solved the issue! the shap_values now contain the outcome and treatment variable name. The summary plot still looks the same. I am digging deeper to see if it is actually showing the appropriate values

jbel1026 commented 3 years ago

@vsyrgkanis ok after looking into it more, the update fixed how the shap values were named, but the values are still incorrect:

I am working on getting some dummy data, but in the mean time I can show you how I am fitting the model and calling the shap values

est_xx = CausalForestDML(model_y= model_y_xx, model_t= model_t_xx, discrete_treatment = True, cv= 5, n_estimators = 500, max_features = 'sqrt', inference = True, random_state = 123, verbose = 0 )

est_xx_fit = est_xx.dowhy.fit(Y = train[Y], T = train[T], X = train[X], W = train[W], cache_values=True)

shap_values = est_xx_fit.shap_values(modeling_df[X])

shap.summary_plot(shap_values[Y[0]][T[0]+'_1.0'])

That last shap summary plot outputs this:

Which I think is wrong, because, when I look at DCSI Score Vs. CATE, there is a very strong positive relationship, that is not represented in the SHAP plot:

vsyrgkanis commented 3 years ago

As a first try no using the dowhy wrapper just in case the problem is ther:

est_xx = CausalForestDML(model_y= model_y_xx, model_t= model_t_xx, discrete_treatment = True, cv= 5, n_estimators = 500, max_features = 'sqrt', inference = True, random_state = 123, verbose = 0 )

est_xx_fit = est_xx.fit(Y = train[Y], T = train[T], X = train[X], W = train[W], cache_values=True)

shap_values = est_xx_fit.shap_values(modeling_df[X])

shap.summary_plot(shap_values[Y[0]][T[0]+'_1'])

As a second, try using larger min_samples_leaf, e.g. min_samples_leaf=20. You might be hitting some ill-posedness of the variance in the treatment within leafs.

As a third, try calling tune of the forest before fitting. That might choose much better hyperparams for your dataset:

est_xx = CausalForestDML(model_y= model_y_xx, model_t= model_t_xx, discrete_treatment = True, cv= 5, n_estimators = 500, max_features = 'sqrt', inference = True, random_state = 123, verbose = 0 )

est_xx.tune(Y = train[Y], T = train[T], X = train[X], W = train[W])

est_xx_fit = est_xx.fit(Y = train[Y], T = train[T], X = train[X], W = train[W], cache_values=True)

shap_values = est_xx_fit.shap_values(modeling_df[X])

shap.summary_plot(shap_values[Y[0]][T[0]+'_1'])

jbel1026 commented 3 years ago

@vsyrgkanis Thanks so much for these tips, after running the suggestions the results were still the same

vsyrgkanis commented 3 years ago

Try maybe calculating shap values on a subsample of the train_df and not a new sample. Not sure why that would do any difference, but try it out.

vsyrgkanis commented 3 years ago

Another thing to make sure. Can you try running our notebooks where causalforestdml + shap values are used and see if you get the same results as the ones that are posted on the git repo in the ran notebooks?

vsyrgkanis commented 3 years ago

Try also setting params: min_var_fraction_leaf = .1 min_var_leaf_on_val = True

vsyrgkanis commented 3 years ago

Finally, also try more trees together with the above options, i.e. n_estimators=4000 min_var_fraction_leaf = .1 min_var_leaf_on_val = True

Trying these out will help us identify what the source of this behavior is.

jbel1026 commented 3 years ago

@vsyrgkanis This looked like it worked! Here is the code I used, you will notice a couple changes I made I reduced the amount of trees for computational reasons, I tried with and without the dowhy wrapper and the results were the same

vsyrgkanis commented 3 years ago

Based on this here is the most probable reason: Shap values for causal forests are only approximate. The reason is that the prediction pf a causal forest first calculates the conditional average jacobian of the moment by averaging across the trees and then inverts it to produce the prediction.

Shap values explain the prediction of each tree (for computational reasons) and then average the shap values. But if the trees are deep and some leafs can end up with small variation in the treatment on the estimation half sample of an honest tree. In that case the tree based jacobian can be ill posed and the prediction of a single tree can be unstable. The shap values for some trees can then can behave weirdly and then averaging shap values across trees might not help that much.

The option min_var_fraction_leaf ensures a minimum treatment variation on the leafs while splitting and setting min_var_leaf_on_val to True also ensures this minimum variation also holds on the estimation sample too. Thus avoids the jacobian of a single tree to be ill posed.

the tuning of the forest then also could potentially choose the appropriate depth for best prediction and might have chosen a more shallower tree than the default option, which also helps the ill-posedeness of the tree jacobians.

Both of these help to reduce the approximation error of the calculated shap values.

most probably we need to add some disclaimers along these lines in the docs and even warnings when someone runs shap values for causal forest

jbel1026 commented 3 years ago

@vsyrgkanis Awesome, thanks so much for the explaination!

jbel1026 commented 3 years ago

@vsyrgkanis on a side note, why is there not a tune option for other forest based estimators, ForestDRLearner?

vsyrgkanis commented 3 years ago

@jbel1026 great point!! There should be. One point that needed some care, though in the end will be ok, is that the current tuning in cfdml is done using the RScorer, which is also roughly what cfdml is trying to optimize.

But there might be a small discrepancy between the rscorer and drlearner, which is trying to optimize a different loss, and maybe first we need to implement the DRScorer (which uses the doubly robust loss to score an estimator) and use that in the tuning of the ForestDRLearner.

Or alternatively we could be using the RScorer for all estimator tuning, as the RScorer is a universal quality score for all estimators.

But you are right all forest estimators should be tunable.

jbel1026 commented 3 years ago

@vsyrgkanis ok makes sense, thanks!

gcasamat commented 3 years ago

I have a related question. Suppose that, for the reasons exposed in this issue, I don't "trust" the results when SHAP is applied to CausalForestDML. Does it make sense to build a decision tree that explains the CATEs predicted by CausalForestDML and then apply SHAP to this decision tree?

vsyrgkanis commented 3 years ago

See our cate tree interpreters.

The return exactly such a tree.

But yes doing it outside of the package and interpreting with shap makes sense too.

You can then even fit a forest and not just a tree which is what our cate interpreters are doing.

gcasamat commented 3 years ago

Thanks for your reply! To be sure I understand fully: what is the advantage of fitting a forest rather than a tree? I had in mind to build a fully grown tree and then apply SHAP to uncover the "important" variables. Should I care about overfitting in this context?