guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
434 stars 98 forks source link

Random binning outputs generated #310

Closed lx0531 closed 3 months ago

lx0531 commented 3 months ago

Hi, I encountered something similar to issue #299. More specifically, running ContinuousOptimalBinning with the same setting can lead to a different number of prebins and a different monotonic trend, when eventually cause a different binning output to be generated. So far I have only observed this issue when using "quantile" as the prebinning_method. Here is an example attached below:

tmp_var = "bathrooms_0_newkey"
x = df2[tmp_var].values
y = df2["incurred_loss_ratio"]
#random.seed(33)
optb = ContinuousOptimalBinning(name = tmp_var, 
                                dtype = "numerical", 
                                prebinning_method = "quantile", 
                                min_bin_size = 0.001,
                                monotonic_trend = "auto",
                                verbose = True
                               )

optb.fit(x,y)
print(optb.splits)

The output when running it the first time is:

2024-04-03 19:44:50,140 | INFO : Optimal binning started.
2024-04-03 19:44:50,141 | INFO : Options: check parameters.
2024-04-03 19:44:50,142 | INFO : Pre-processing started.
2024-04-03 19:44:50,142 | INFO : Pre-processing: number of samples: 350938
2024-04-03 19:44:50,162 | INFO : Pre-processing: number of clean samples: 350938
2024-04-03 19:44:50,163 | INFO : Pre-processing: number of missing samples: 0
2024-04-03 19:44:50,163 | INFO : Pre-processing: number of special samples: 0
2024-04-03 19:44:50,163 | INFO : Pre-processing terminated. Time: 0.0189s
2024-04-03 19:44:50,164 | INFO : Pre-binning started.
2024-04-03 19:44:50,256 | INFO : Pre-binning: number of prebins: 5
2024-04-03 19:44:50,257 | INFO : Pre-binning terminated. Time: 0.0917s
2024-04-03 19:44:50,257 | INFO : Optimizer started.
2024-04-03 19:44:50,259 | INFO : Optimizer: classifier predicts ascending monotonic trend.
2024-04-03 19:44:50,259 | INFO : Optimizer: monotonic trend set to ascending.
2024-04-03 19:44:50,260 | INFO : Optimizer: build model...
2024-04-03 19:44:50,263 | INFO : Optimizer: solve...
2024-04-03 19:44:50,269 | INFO : Optimizer terminated. Time: 0.0114s
2024-04-03 19:44:50,270 | INFO : Post-processing started.
2024-04-03 19:44:50,270 | INFO : Post-processing: compute binning information.
2024-04-03 19:44:50,280 | INFO : Post-processing terminated. Time: 0.0089s
2024-04-03 19:44:50,280 | INFO : Optimal binning terminated. Status: OPTIMAL. Time: 0.1399s
[]

and the output at the second running becomes:

2024-04-03 19:44:54,779 | INFO : Optimal binning started.
2024-04-03 19:44:54,780 | INFO : Options: check parameters.
2024-04-03 19:44:54,781 | INFO : Pre-processing started.
2024-04-03 19:44:54,781 | INFO : Pre-processing: number of samples: 350938
2024-04-03 19:44:54,800 | INFO : Pre-processing: number of clean samples: 350938
2024-04-03 19:44:54,801 | INFO : Pre-processing: number of missing samples: 0
2024-04-03 19:44:54,801 | INFO : Pre-processing: number of special samples: 0
2024-04-03 19:44:54,802 | INFO : Pre-processing terminated. Time: 0.0184s
2024-04-03 19:44:54,802 | INFO : Pre-binning started.
2024-04-03 19:44:54,894 | INFO : Pre-binning: number of prebins: 4
2024-04-03 19:44:54,895 | INFO : Pre-binning terminated. Time: 0.0916s
2024-04-03 19:44:54,895 | INFO : Optimizer started.
2024-04-03 19:44:54,897 | INFO : Optimizer: classifier predicts descending monotonic trend.
2024-04-03 19:44:54,898 | INFO : Optimizer: monotonic trend set to descending.
2024-04-03 19:44:54,898 | INFO : Optimizer: build model...
2024-04-03 19:44:54,900 | INFO : Optimizer: solve...
2024-04-03 19:44:54,906 | INFO : Optimizer terminated. Time: 0.0098s
2024-04-03 19:44:54,906 | INFO : Post-processing started.
2024-04-03 19:44:54,907 | INFO : Post-processing: compute binning information.
2024-04-03 19:44:54,916 | INFO : Post-processing terminated. Time: 0.0089s
2024-04-03 19:44:54,917 | INFO : Optimal binning terminated. Status: OPTIMAL. Time: 0.1375s
[3. 4. 5.]

Any help and clarification is greatly appreciated, thanks!

guillermo-navas-palencia commented 3 months ago

Hi @lx0531,

Thanks for reporting this issue. I noticed the subsample default value has changed in recent releases of sklearn.preprocessing.KBinsDiscretizer

image

Until I release a new version setting subsample=None, you can pass subsample=None as the prebinning_kwargs:

image

or use "cart" as a prebinning method. I hope this helps.

lx0531 commented 3 months ago

Hi @guillermo-navas-palencia ,

I added subsample=None as the additional argument and the issue is solved. Really appreciate your help!