guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
434 stars 98 forks source link

BinningProcess Behavior Mismatch with OptimalBinning for same Settings #313

Open jnsofini opened 2 months ago

jnsofini commented 2 months ago

Description: I encountered an issue while implementing a setup involving 1D OptimalBinning and BinningProcess. Specifically, I observed the following behavior:

When I use 1D OptimalBinning, it successfully provides the desired binning. However, when I copy the same settings and apply them in a BinningProcess, I encounter complaints about “pure bins.” My initial understanding was that BinningProcess internally utilizes OptimalBinning. However, the discrepancy between the two processes has left me puzzled.

Is there a specific setting I might be overlooking, or could there be inherent randomness affecting the results?

Here is the settings

user_splits = [210_000, 375_000]
user_splits_fixed = [True, True]
optb = OptimalBinning(
    name=variable,
    dtype="numerical",
    user_splits=user_splits,
    user_splits_fixed=user_splits_fixed,
    min_prebin_size=10e-5,
    # max_n_bins=5,
    monotonic_trend="auto_asc_desc",
    special_codes=[-9]
)
optb.fit(X_train[variable].values, y_train)
optb.binning_table.build()

image

When I copy the following to a BinningProcess as follows

binning_process2 = BinningProcess(
    # categorical_variables=list_categorical,
    variable_names=[variable],
    # special_codes=special_codes,
    binning_fit_params={
        "B1_CUST_EXPOSURE_AMT": {
            "dtype": "numerical",
            "user_splits": [210_000, 375_000],
            "user_splits_fixed": [True, True],
            "special_codes": [-9],
            "min_prebin_size": 10e-5,
            # "max_n_bins": 5,
            "monotonic_trend": "auto_asc_desc",
        }
    },
)
binning_process2.fit(X_train_transformed[[variable]], y_train)

I get the error

_ValueError: Fixed usersplits [375000] are removed because produce pure prebins. Provide different splits to be fixed.

guillermo-navas-palencia commented 2 months ago

Hi @jnsofini. Could you please provide a dataset to reproduce this behaviour? Thanks!