Discover data patterns by investigating leaf training samples

tlapusan commented 4 years ago

Generally speaking, ML trained models can be used to both make predictions and/or to better understand our data. Until now, we have created visualisations for histogram of classes in leaves, split plots for the regressor leaves, but we don't have a way to find out more about training samples reaching those leaves. I think it will be useful to have a method to return the leaf training examples and maybe some general stats about them.

Bellow, I have attached a screenshot with get_node_samples() from my library. I guess it could be pretty easy integrated in dtreeviz, taking in consideration that there is already built-in functionality to take the samples from a leaf.

parrt commented 4 years ago

I agree that get_node_samples() functionality would be useful but I think we already have in shadow.py:

    @staticmethod
    def node_samples(tree_model, data) -> Mapping[int, list]:
        """
        Return dictionary mapping node id to list of sample indexes considered by
        the feature/split decision.
...

parrt commented 4 years ago

I'd like to see more interpretation stuff that explained the path from the root down to the leaf. I.e., explain(test_vector) would tell me lots about why it gets the prediction it gets. Perhaps we can add counterfactuals to indicate how to tweak a given record in order to make it select a particular class such as "give loan" or "do not predict cancer"

tlapusan commented 4 years ago

I agree that get_node_samples() functionality would be useful but I think we already have in shadow.py:
    @staticmethod
    def node_samples(tree_model, data) -> Mapping[int, list]:
        """
        Return dictionary mapping node id to list of sample indexes considered by
        the feature/split decision.
...

Nice, good to know :). I will add few examples of it in notebooks (will help also to make it more visible)

tlapusan commented 4 years ago

I'd like to see more interpretation stuff that explained the path from the root down to the leaf. I.e., explain(test_vector) would tell me lots about why it gets the prediction it gets. Perhaps we can add counterfactuals to indicate how to tweak a given record in order to make it select a particular class such as "give loan" or "do not predict cancer"

I have two types of visualisations for prediction path interpretation, but they require a facelift :)) Would be nice to keep the same visualisation as dtreeviz(), but to show only the prediction path. I will take a deeper look at how dtreeviz() is implemented and maybe I can adapt it to prediction path.

parrt commented 4 years ago

well, it already does the prediction path; you can take a look at the examples. It just doesn't combine all of the decision nodes into some readable English text that explains why somebody did not get their loan. The decision trees we have already show those histograms so I think everything is already incorporated that you mention. :)

tlapusan commented 4 years ago

well, it already does the prediction path; you can take a look at the examples. It just doesn't combine all of the decision nodes into some readable English text that explains why somebody did not get their loan. The decision trees we have already show those histograms so I think everything is already incorporated that you mention. :)

yes, indeed. I was aware of it, but I thought that you mean a visualisation only for nodes from prediction path...

It just doesn't combine all of the decision nodes into some readable English text that explains why somebody did not get their loan. I will think what we can do here. If you have any suggestions, let me know ;)

parrt commented 4 years ago

Oh right! Yes, a way to turn off non-prediction path nodes would be a useful option to the visualizer. should be a simple new boolean like "show_just_path" that works if they pass in a sample test record. It would turn off the orange highlighting and just show the nodes on the path.

tlapusan commented 4 years ago

it will help especially if the tree is very deep.

parrt commented 4 years ago

We could summarize the "weight" of each feature used to get through the path. currently i just highlight the feature in orange in test vector. We could use "avg contribution". see https://christophm.github.io/interpretable-ml-book/tree.html#interpretation-2

tlapusan commented 4 years ago

We could summarize the "weight" of each feature used to get through the path. currently i just highlight the feature in orange in test vector. We could use "avg contribution". see https://christophm.github.io/interpretable-ml-book/tree.html#interpretation-2

I guess this is a possible solution for "It just doesn't combine all of the decision nodes into some readable English text that explains why somebody did not get their loan." . Right ?

Thanks for the link. It seems like a book to read.

parrt commented 4 years ago

Yep. the link shows how to compute "gini drop" importance for a single instance. also i wonder if just combining all decisions < and > into a single range per feature could be useful. I.e., "your income was in this range" and your education was in this range...

parrt commented 4 years ago

Ah. I think this RuleFit alg is what I'm thinking: https://christophm.github.io/interpretable-ml-book/rulefit.html https://arxiv.org/abs/0811.1679

parrt commented 4 years ago

Ah. see this stuff http://blog.datadive.net/interpreting-random-forests/

parrt commented 4 years ago

Also see https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html which shows how to do English descriptions like:

Rules used to predict sample 0:
decision id node 0 : (X_test[0, 3] (= 2.4) > 0.800000011920929)
decision id node 2 : (X_test[0, 2] (= 5.1) > 4.950000047683716)

tlapusan commented 4 years ago

hi @parrt. I added to tree_structure_example notebook few leaf sample investigations (created a PR also #76 ).

It looks something like that :

Right now to get the leaf samples, we need the following code : node_samples = ShadowDecTree.node_samples(dtc, dataset[features]) dataset[features + [target]].iloc[node_samples[58]].describe() It seems a little to much... I would like to wrap-up all the details into a method, something like this : describe_node_sample(node_id=58, dataset[features]). What is your opinion ? :)

parrt commented 4 years ago

looks great! Hm...yeah, but don't we need the model in there? describe(mytreemodel, node_id, dataset[features])

tlapusan commented 4 years ago

yes, we need also the model as a parameter :)

parrt / dtreeviz

Discover data patterns by investigating leaf training samples #65