guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
452 stars 100 forks source link

min_bin_size and max_bin_size not working when using sample_weight in ContinuousOptimalBinning #323

Open diegodebrito opened 2 months ago

diegodebrito commented 2 months ago

The parameters min_bin_size and max_bin_size don't seem to work well when passing sample_weight during fit. The example below produces only one bin, regardless of value for those parameters.

Removing sample_weight from the fit call seems to work properly (you can just comment that out and rerun the example below).

Please let me know if it's my lack of understanding or if I'm using the tool incorrectly.

import pandas as pd
from optbinning import ContinuousOptimalBinning

df = pd.DataFrame({'value': {0: 0.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: 5.0,
  6: 6.0,
  7: 7.0,
  8: 8.0,
  9: 9.0},
 'target': {0: 7.747250464922968,
  1: 6.527567693419396,
  2: 5.951775031334447,
  3: 5.4739748791420855,
  4: 5.635028933057227,
  5: 5.177333709759795,
  6: 5.242660923463983,
  7: 4.681195578721209,
  8: 4.921130922493046,
  9: 4.698432205030768},
 'num_obs': {0: 166252,
  1: 305567,
  2: 245220,
  3: 182303,
  4: 137543,
  5: 113468,
  6: 99369,
  7: 92211,
  8: 87613,
  9: 76431}})

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.1,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target'], 
         sample_weight=df['num_obs']
        )

binning_table = optb.binning_table
binning_table.build()
binning_table.plot()
diegodebrito commented 3 days ago

Hi @guillermo-navas-palencia, wondering if you could check on this. I'm adding a more comprehensive example below:

Baseline works fine and finds the bins

import pandas as pd
from optbinning import ContinuousOptimalBinning

df = pd.DataFrame({'value': {0: 0.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: 5.0,
  6: 6.0,
  7: 7.0,
  8: 8.0,
  9: 9.0},
'target': {0: 7.747250464922968,
  1: 6.527567693419396,
  2: 5.951775031334447,
  3: 5.4739748791420855,
  4: 5.635028933057227,
  5: 5.177333709759795,
  6: 5.242660923463983,
  7: 4.681195578721209,
  8: 4.921130922493046,
  9: 4.698432205030768},                   
})

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target'], 
        )

print(optb.status)

binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

image

Adding weights = 10 to each observation is unfeasible (which is weird, since it would be a simple scaling)

df['num_obs'] = [10] * 10

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target'], 
         sample_weight=df['num_obs']
        )

print(optb.status)
binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

image

Repeating observations 10 times instead of using weights works fine:

df = df.loc[df.index.repeat([10 for i in range(10)])]

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target']
        )

print(optb.status)

binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

image

Thanks for your work on this great tool!