guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
459 stars 100 forks source link

Summary statistics could be incorrect when using #324

Open diegodebrito opened 4 months ago

diegodebrito commented 4 months ago

I created a simple dataframe with age, salary, and num_obs:

import pandas as pd
from optbinning import ContinuousOptimalBinning

df = pd.DataFrame({'age': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
 'salary': {0: 0.7739560485559633,
  1: 0.4388784397520523,
  2: 0.8585979199113825,
  3: 0.6973680290593639,
  4: 0.09417734788764953,
  5: 0.9756223516367559,
  6: 0.761139701990353,
  7: 0.7860643052769538,
  8: 0.12811363267554587,
  9: 0.45038593789556713},
 'num_obs': {0: 5, 1: 4, 2: 3, 3: 7, 4: 6, 5: 6, 6: 5, 7: 7, 8: 5, 9: 5}})

Better displayed as:

age salary num_obs
1 0.773956 5
2 0.438878 4
3 0.858598 3
4 0.697368 7
5 0.0941773 6
6 0.975622 6
7 0.76114 5
8 0.786064 7
9 0.128114 5
10 0.450386 5

I then generated optimal bins using num_obs as sample weights:

optb = ContinuousOptimalBinning(dtype="numerical")
optb.fit(df['age'], df['salary'], sample_weight=df['num_obs'])

binning_table = optb.binning_table
binning_table.build()

Which results in:

Bin Count Count (%) Sum Std Mean Min Max Zeros count WoE IV
0 (-inf, 1.50) 5 0.0943396 3.86978 1.5479120971119267 0.773956 3.86978 3.86978 0 0.175803 0.0165852
1 [1.50, 4.50) 14 0.264151 9.21288 1.401113568517211 0.658063 1.75551 4.88158 0 0.0599101 0.0158253
2 [4.50, 8.50) 24 0.45283 15.7269 1.6960750859575562 0.655289 0.565064 5.85373 0 0.0571365 0.0258731
3 [8.50, 9.50) 5 0.0943396 0.640568 0.25622726535109175 0.128114 0.640568 0.640568 0 -0.470039 0.0443433
4 [9.50, inf) 5 0.0943396 2.25193 0.9007718757911344 0.450386 2.25193 2.25193 0 -0.147767 0.0139403
5 Special 0 0 0 nan 0 nan nan 0 -0.598153 0
6 Missing 0 0 0 nan 0 nan nan 0 -0.598153 0
Totals 53 1 31.7021 0.598153 0.565064 5.85373 0 2.10696 0.116567

Notice how row 3 (with bin [8.50, 9.50)) has Std different than 0. Since the only age that falls on that bin is 8, I don't understand how the std could be different than 0. The other statistics are also quite odd/don't make sense.

Please let me know if there is an issue when using weights or if I'm understanding the results wrong.

Thanks!

Ps: this might be related to this issue: https://github.com/guillermo-navas-palencia/optbinning/issues/323