guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
459 stars 100 forks source link

Add multiple graph option #315

Open lcrmorin opened 6 months ago

lcrmorin commented 6 months ago

I am usually using individual optbinning plots for data exploration. This is done trough a for loop. I was wondering if there could be a default multi-plot for opt-binning. This would be my default function fo data exploration. Typically, nowadays the best option is the pandas hist (see below).

Screenshot 2024-05-01 at 08 19 05

It would be very nice to have such a plot with binning and target dependency for data exploration.

bmreiniger commented 6 months ago

What do your loops look like? Are you using a BinningProcess, getting each underlying variable, and using the associated table's plot method? In which case, we could just add a plot method to BinningProcess that does that? Should OptimalBinning and friends directly expose a plot as an intermediate?

guillermo-navas-palencia commented 6 months ago

Hi @lcrmorin. Indeed, it would be a nice addition. Would you be willing to work on this feature?

lcrmorin commented 6 months ago

I was trying to do it myself. Ultimately my problem relates more to positioning the plots on the grid, than optbinning itself.

bmreiniger commented 6 months ago

I think adding an ax parameter to the tables' plot method (to be an existing pyplot axis object, or None to create a new one; this is how pandas and sklearn both implement many plotting utilities IIRC?) would be an improvement in general, and would also make this easier. BinningProcess would just have to make_subplots then iterate over zip(axes, _binning_variables)? If I have some time I'll give a PR a go, but happy to let someone else try instead.

bmreiniger commented 6 months ago

I've started looking more seriously at the code, and I don't really use the 2d or pw or streaming binners; should they all support plotting?

In a BinningProcess, should the relevant statistics tables have build run first (with default parameters, or try to pass keywords through based on the type of column?)?