min_bin_size and max_bin_size not working when using sample_weight in ContinuousOptimalBinning

guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.

Apache License 2.0

452 stars 100 forks source link

import pandas as pd from optbinning import ContinuousOptimalBinning df = pd.DataFrame({'value': {0: 0.0, 1: 1.0, 2: 2.0, 3: 3.0, 4: 4.0, 5: 5.0, 6: 6.0, 7: 7.0, 8: 8.0, 9: 9.0}, 'target': {0: 7.747250464922968, 1: 6.527567693419396, 2: 5.951775031334447, 3: 5.4739748791420855, 4: 5.635028933057227, 5: 5.177333709759795, 6: 5.242660923463983, 7: 4.681195578721209, 8: 4.921130922493046, 9: 4.698432205030768}, 'num_obs': {0: 166252, 1: 305567, 2: 245220, 3: 182303, 4: 137543, 5: 113468, 6: 99369, 7: 92211, 8: 87613, 9: 76431}}) variable = "target" optb = ContinuousOptimalBinning(dtype="numerical", min_bin_size=0.1, max_bin_size=1.0, ) optb.fit(df['value'], df['target'], sample_weight=df['num_obs'] ) binning_table = optb.binning_table binning_table.build() binning_table.plot()

Hi @guillermo-navas-palencia, wondering if you could check on this. I'm adding a more comprehensive example below:

Baseline works fine and finds the bins

import pandas as pd
from optbinning import ContinuousOptimalBinning

df = pd.DataFrame({'value': {0: 0.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: 5.0,
  6: 6.0,
  7: 7.0,
  8: 8.0,
  9: 9.0},
'target': {0: 7.747250464922968,
  1: 6.527567693419396,
  2: 5.951775031334447,
  3: 5.4739748791420855,
  4: 5.635028933057227,
  5: 5.177333709759795,
  6: 5.242660923463983,
  7: 4.681195578721209,
  8: 4.921130922493046,
  9: 4.698432205030768},                   
})

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target'], 
        )

print(optb.status)

binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

Adding weights = 10 to each observation is unfeasible (which is weird, since it would be a simple scaling)

df['num_obs'] = [10] * 10

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target'], 
         sample_weight=df['num_obs']
        )

print(optb.status)
binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

Repeating observations 10 times instead of using weights works fine:

df = df.loc[df.index.repeat([10 for i in range(10)])]

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target']
        )

print(optb.status)

binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

Thanks for your work on this great tool!

guillermo-navas-palencia / optbinning