Closed tlapusan closed 4 years ago
I agree that get_node_samples() functionality would be useful but I think we already have in shadow.py:
@staticmethod
def node_samples(tree_model, data) -> Mapping[int, list]:
"""
Return dictionary mapping node id to list of sample indexes considered by
the feature/split decision.
...
I'd like to see more interpretation stuff that explained the path from the root down to the leaf. I.e., explain(test_vector)
would tell me lots about why it gets the prediction it gets. Perhaps we can add counterfactuals to indicate how to tweak a given record in order to make it select a particular class such as "give loan" or "do not predict cancer"
I agree that get_node_samples() functionality would be useful but I think we already have in shadow.py:
@staticmethod def node_samples(tree_model, data) -> Mapping[int, list]: """ Return dictionary mapping node id to list of sample indexes considered by the feature/split decision. ...
Nice, good to know :). I will add few examples of it in notebooks (will help also to make it more visible)
I'd like to see more interpretation stuff that explained the path from the root down to the leaf. I.e.,
explain(test_vector)
would tell me lots about why it gets the prediction it gets. Perhaps we can add counterfactuals to indicate how to tweak a given record in order to make it select a particular class such as "give loan" or "do not predict cancer"
I have two types of visualisations for prediction path interpretation, but they require a facelift :)) Would be nice to keep the same visualisation as dtreeviz(), but to show only the prediction path. I will take a deeper look at how dtreeviz() is implemented and maybe I can adapt it to prediction path.
well, it already does the prediction path; you can take a look at the examples. It just doesn't combine all of the decision nodes into some readable English text that explains why somebody did not get their loan. The decision trees we have already show those histograms so I think everything is already incorporated that you mention. :)
well, it already does the prediction path; you can take a look at the examples. It just doesn't combine all of the decision nodes into some readable English text that explains why somebody did not get their loan. The decision trees we have already show those histograms so I think everything is already incorporated that you mention. :)
yes, indeed. I was aware of it, but I thought that you mean a visualisation only for nodes from prediction path...
It just doesn't combine all of the decision nodes into some readable English text that explains why somebody did not get their loan. I will think what we can do here. If you have any suggestions, let me know ;)
Oh right! Yes, a way to turn off non-prediction path nodes would be a useful option to the visualizer. should be a simple new boolean like "show_just_path" that works if they pass in a sample test record. It would turn off the orange highlighting and just show the nodes on the path.
it will help especially if the tree is very deep.
We could summarize the "weight" of each feature used to get through the path. currently i just highlight the feature in orange in test vector. We could use "avg contribution". see https://christophm.github.io/interpretable-ml-book/tree.html#interpretation-2
We could summarize the "weight" of each feature used to get through the path. currently i just highlight the feature in orange in test vector. We could use "avg contribution". see https://christophm.github.io/interpretable-ml-book/tree.html#interpretation-2
I guess this is a possible solution for "It just doesn't combine all of the decision nodes into some readable English text that explains why somebody did not get their loan." . Right ?
Thanks for the link. It seems like a book to read.
Yep. the link shows how to compute "gini drop" importance for a single instance. also i wonder if just combining all decisions <
and >
into a single range per feature could be useful. I.e., "your income was in this range" and your education was in this range...
Ah. I think this RuleFit alg is what I'm thinking: https://christophm.github.io/interpretable-ml-book/rulefit.html https://arxiv.org/abs/0811.1679
Ah. see this stuff http://blog.datadive.net/interpreting-random-forests/
Also see https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html which shows how to do English descriptions like:
Rules used to predict sample 0:
decision id node 0 : (X_test[0, 3] (= 2.4) > 0.800000011920929)
decision id node 2 : (X_test[0, 2] (= 5.1) > 4.950000047683716)
hi @parrt. I added to tree_structure_example notebook few leaf sample investigations (created a PR also #76 ).
It looks something like that :
Right now to get the leaf samples, we need the following code :
node_samples = ShadowDecTree.node_samples(dtc, dataset[features]) dataset[features + [target]].iloc[node_samples[58]].describe()
It seems a little to much... I would like to wrap-up all the details into a method, something like this :
describe_node_sample(node_id=58, dataset[features])
.
What is your opinion ? :)
looks great! Hm...yeah, but don't we need the model in there? describe(mytreemodel, node_id, dataset[features])
yes, we need also the model as a parameter :)
Generally speaking, ML trained models can be used to both make predictions and/or to better understand our data. Until now, we have created visualisations for histogram of classes in leaves, split plots for the regressor leaves, but we don't have a way to find out more about training samples reaching those leaves. I think it will be useful to have a method to return the leaf training examples and maybe some general stats about them.
Bellow, I have attached a screenshot with get_node_samples() from my library. I guess it could be pretty easy integrated in dtreeviz, taking in consideration that there is already built-in functionality to take the samples from a leaf.