parrt / dtreeviz

A python library for decision tree visualization and model interpretation.
MIT License
2.94k stars 331 forks source link

Enable `ctree_feature_space()` to run on trees with more than 1 or 2 features. #242

Closed mepland closed 1 year ago

mepland commented 1 year ago

Currently ctree_feature_space() only works for trees with 1 or 2 input features:

# TODO: check if we can find some common functionality between univar and bivar visualisations and refactor
#  to a single method.
if len(self.shadow_tree.feature_names) == 1:     # univar example
    _ctreeviz_univar(self.shadow_tree, fontsize, ticks_fontsize, fontname, nbins, gtype, show, colors, figsize, ax)
elif len(self.shadow_tree.feature_names) == 2:   # bivar example
    _ctreeviz_bivar(self.shadow_tree, fontsize, ticks_fontsize, fontname, show, colors, figsize, ax)
else:
    raise ValueError(f"ctree_feature_space supports a dataset with only one or two features."
                     f" You provided a dataset with {len(self.shadow_tree.feature_names)} features {self.shadow_tree.feature_names}.")

We should allow ctree_feature_space() to run on trees with any number of features, as long as the 1 or 2 features to be plotted are specified via a parameter.

It is useful to visualize the feature space, with all splits on that feature, for any tree - not just toy trees trained specifically on the feature(s) of interest. We should be able to run this on any tree, if we specify what feature_name to extract.

We may want to also address the refactoring comment at the same time.

mepland commented 1 year ago

@parrt @tlapusan I can rework my initial solution from #200, including a refactoring, but I wanted to get your thoughts first.

tlapusan commented 1 year ago

@mepland the suggestion sounds good. If the initial tree will be train on multiple features then the feature space/predictions of those 1,2 features we want to display will be influenced also by the other feature contributions.

mepland commented 1 year ago

Yes, the splits displayed will be potentially conditional on other variables, but I find it can still be illuminating to see where the splitting is happening on a feature across all branches.

@parrt what do you think?

parrt commented 1 year ago

Yes I think this does make sense. In other words right now I force the model and the plot to be one or two variables, and now you are proposing to take any model and then simply display one or two variables. This is kind of like partial dependence plots that ignore the effect of other variables on the model.

Yes I'm OK with that!

parrt commented 1 year ago

Decided to bang this out myself so I understand it again...in progress https://github.com/parrt/dtreeviz/pull/253

mepland commented 1 year ago

Sounds good @parrt, LMK if / when you'd like me to test it!

parrt commented 1 year ago

getting closer...working on univar

mepland commented 1 year ago

Implemented in https://github.com/parrt/dtreeviz/pull/253