parrt / dtreeviz

A python library for decision tree visualization and model interpretation.
MIT License
2.89k stars 333 forks source link

Handle numerous leaves better in leaf stats plots - no overlaps #247

Open mepland opened 1 year ago

mepland commented 1 year ago

When a tree has many leaves the x-axis ticks of the leaf stats plots can begin to overlap. One solution, illustrated below with ctree_leaf_distributions(), is to make the plot larger with the figsize parameter. However, we should probably build another solution to selectively label some leaves, or otherwise turn off the individual labels on all leaves.

See additional discussion on https://github.com/parrt/dtreeviz/pull/220

Screenshot 2023-01-09 at 08 25 00
tlapusan commented 1 year ago

Another solution would be to have a horizontal scrolling, like explained here : https://www.geeksforgeeks.org/python-scroll-through-plots/

Made some fast copy/paste with some scrolling code examples, but didn't work for me. Further investigation would be needed.

mepland commented 1 year ago

Another solution would be to have a horizontal scrolling

I think we should still have a printable, to a flat/non-interactive file, solution as well.

parrt commented 1 year ago

Yes definitely we want to be able to print something. If I understand they use case, we'd like to learn which leaves are the most important and to learn about information on the samples in that leaf. Why are we not then simply sorting most important on the left so that we are guaranteed to at least see that stuff. Nobody cares about the node id value per se, you only need the node id to ask questions about it later.

So in other words for the graph above that giant spike at node 99 should be shifted all the way to the left.

Another possibility is to remove the tick labels from the X axis and make them sit above the bars themselves. The text could probably be rotated left 90° so it goes Down to open status left to right. That should make a bunch more room

mepland commented 1 year ago

Let me try prototyping out a solution that uses matplotlib to draw the axis ticks where it thinks they should be. Would be a maintainable solution as we're relying on matplotlib, it just may have gaps along the axis if the node_id's are not sequential.

mepland commented 1 year ago

@parrt @tlapusan see here: https://github.com/parrt/dtreeviz/pull/254

mepland commented 1 year ago

Fixed for ctree_leaf_distributions in https://github.com/parrt/dtreeviz/pull/254

@parrt should I now do the same for rtree_leaf_distributions, leaf_sizes, leaf_purity? I should probably refactor the code to be reusable, similar to _format_axes. Are there any other plots I am missing?

parrt commented 1 year ago

basically, you'd update all leaf plots? Makes sense to me. Probably good to refactor, yep!