Open jbel1026 opened 3 years ago
Thanks @jbel1026 ! This looks something we should definitely look into!
If you could send us any fake data that would re-create the same problem, trying to replicate the above behavior somehow would be fantastic for us to understand why this might be happening, as otherwise we have no way to replicate the problem.
Also if you could also paste snippets of code that generated these plots, especially the second plot would also be helpful.
Best, V
Try the econml 0.11 too just in case (we made some small changes in how we call shap), but I don't that would fix the problem.
@vsyrgkanis I believe that the update to 0.11 solved the issue! the shap_values now contain the outcome and treatment variable name. The summary plot still looks the same. I am digging deeper to see if it is actually showing the appropriate values
@vsyrgkanis ok after looking into it more, the update fixed how the shap values were named, but the values are still incorrect:
I am working on getting some dummy data, but in the mean time I can show you how I am fitting the model and calling the shap values
est_xx = CausalForestDML(model_y= model_y_xx, model_t= model_t_xx, discrete_treatment = True, cv= 5, n_estimators = 500, max_features = 'sqrt', inference = True, random_state = 123, verbose = 0 )
est_xx_fit = est_xx.dowhy.fit(Y = train[Y], T = train[T], X = train[X], W = train[W], cache_values=True)
shap_values = est_xx_fit.shap_values(modeling_df[X])
shap.summary_plot(shap_values[Y[0]][T[0]+'_1.0'])
That last shap summary plot outputs this:
Which I think is wrong, because, when I look at DCSI Score Vs. CATE, there is a very strong positive relationship, that is not represented in the SHAP plot:
As a first try no using the dowhy wrapper just in case the problem is ther:
est_xx = CausalForestDML(model_y= model_y_xx, model_t= model_t_xx, discrete_treatment = True, cv= 5, n_estimators = 500, max_features = 'sqrt', inference = True, random_state = 123, verbose = 0 )
est_xx_fit = est_xx.fit(Y = train[Y], T = train[T], X = train[X], W = train[W], cache_values=True)
shap_values = est_xx_fit.shap_values(modeling_df[X])
shap.summary_plot(shap_values[Y[0]][T[0]+'_1'])
As a second, try using larger min_samples_leaf
, e.g. min_samples_leaf=20
. You might be hitting some ill-posedness of the variance in the treatment within leafs.
As a third, try calling tune of the forest before fitting. That might choose much better hyperparams for your dataset:
est_xx = CausalForestDML(model_y= model_y_xx, model_t= model_t_xx, discrete_treatment = True, cv= 5, n_estimators = 500, max_features = 'sqrt', inference = True, random_state = 123, verbose = 0 )
est_xx.tune(Y = train[Y], T = train[T], X = train[X], W = train[W])
est_xx_fit = est_xx.fit(Y = train[Y], T = train[T], X = train[X], W = train[W], cache_values=True)
shap_values = est_xx_fit.shap_values(modeling_df[X])
shap.summary_plot(shap_values[Y[0]][T[0]+'_1'])
@vsyrgkanis Thanks so much for these tips, after running the suggestions the results were still the same
Try maybe calculating shap values on a subsample of the train_df and not a new sample. Not sure why that would do any difference, but try it out.
Another thing to make sure. Can you try running our notebooks where causalforestdml + shap values are used and see if you get the same results as the ones that are posted on the git repo in the ran notebooks?
Try also setting params: min_var_fraction_leaf = .1 min_var_leaf_on_val = True
Finally, also try more trees together with the above options, i.e. n_estimators=4000 min_var_fraction_leaf = .1 min_var_leaf_on_val = True
Trying these out will help us identify what the source of this behavior is.
@vsyrgkanis This looked like it worked! Here is the code I used, you will notice a couple changes I made I reduced the amount of trees for computational reasons, I tried with and without the dowhy wrapper and the results were the same
Based on this here is the most probable reason: Shap values for causal forests are only approximate. The reason is that the prediction pf a causal forest first calculates the conditional average jacobian of the moment by averaging across the trees and then inverts it to produce the prediction.
Shap values explain the prediction of each tree (for computational reasons) and then average the shap values. But if the trees are deep and some leafs can end up with small variation in the treatment on the estimation half sample of an honest tree. In that case the tree based jacobian can be ill posed and the prediction of a single tree can be unstable. The shap values for some trees can then can behave weirdly and then averaging shap values across trees might not help that much.
The option min_var_fraction_leaf ensures a minimum treatment variation on the leafs while splitting and setting min_var_leaf_on_val to True also ensures this minimum variation also holds on the estimation sample too. Thus avoids the jacobian of a single tree to be ill posed.
the tuning of the forest then also could potentially choose the appropriate depth for best prediction and might have chosen a more shallower tree than the default option, which also helps the ill-posedeness of the tree jacobians.
Both of these help to reduce the approximation error of the calculated shap values.
most probably we need to add some disclaimers along these lines in the docs and even warnings when someone runs shap values for causal forest
@vsyrgkanis Awesome, thanks so much for the explaination!
@vsyrgkanis on a side note, why is there not a tune option for other forest based estimators, ForestDRLearner?
@jbel1026 great point!! There should be. One point that needed some care, though in the end will be ok, is that the current tuning in cfdml is done using the RScorer, which is also roughly what cfdml is trying to optimize.
But there might be a small discrepancy between the rscorer and drlearner, which is trying to optimize a different loss, and maybe first we need to implement the DRScorer (which uses the doubly robust loss to score an estimator) and use that in the tuning of the ForestDRLearner.
Or alternatively we could be using the RScorer for all estimator tuning, as the RScorer is a universal quality score for all estimators.
But you are right all forest estimators should be tunable.
@vsyrgkanis ok makes sense, thanks!
I have a related question. Suppose that, for the reasons exposed in this issue, I don't "trust" the results when SHAP is applied to CausalForestDML. Does it make sense to build a decision tree that explains the CATEs predicted by CausalForestDML and then apply SHAP to this decision tree?
See our cate tree interpreters.
The return exactly such a tree.
But yes doing it outside of the package and interpreting with shap makes sense too.
You can then even fit a forest and not just a tree which is what our cate interpreters are doing.
Thanks for your reply! To be sure I understand fully: what is the advantage of fitting a forest rather than a tree? I had in mind to build a fully grown tree and then apply SHAP to uncover the "important" variables. Should I care about overfitting in this context?
Hello,
I am using
econml
version 0.10.0 andshap
v. 0.39.0When I calculate SHAP values for CausalForestDML estimator the results do not make sense, as opposed to when I calculate them the same way on other estimators like DRL/Meta-learners.
Here is a brief description of my issue:
Y = ['event_flag']
T = ['Diabetes_A1c_Test_compliance_category_Non-Compliant']
So I am able to access my shap values using:
shap.plots.bar(shap_values[Y[0]][T[0]+'_1.0'])
for non-CausalForestDML models, however, for CausalForestDML the shap values take a different form:shap.plots.bar(shap_values['Y0']['T0_1.0'])
, even though everything else has been set up the sameAdditonally, the results dont make sense:
For example, for DRL I see about what I expect: (clear directional relationships between the impact and the value of the feature, for the most part)
But for DML (built the same way, only difference is the estimator used), the results are not at all what I would expect: (very little directional relationships)
Thanks!