guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
452 stars 100 forks source link

quantile method fails for integer valued X #301

Closed dmitry-lesnik closed 7 months ago

dmitry-lesnik commented 7 months ago

If X is integer valued, method "quantile" fails to identify the upper bin.

In the code below there must be 3 perfect bins, corresponding to X-values 0, 1 and 2 However the method merges bins "1" and "2". Making X float and shifting one of the largest values by 1e-7 fixes the issue (but this is a hack, not a solution)

def test_opt_binning_with_integers():
    np.random.seed(666)
    N = 1000
    X = np.random.randint(0, 3, N).astype(float)
    y = np.random.uniform(0, 1, N) + 0.2 * X
    y = (y > y.mean()).astype(int)

    # adding eps > 1e-8 to one of the points solves the problem
    # X[X.argmax()] += 1e-7

    optb = OptimalBinning(name="X", dtype="numerical", prebinning_method="quantile", max_n_bins=3)
    optb.fit(X, y)
    binning_table = optb.binning_table.build(show_digits=3)
    print(binning_table)
    assert (optb.splits == [1.0, 2.0]).all()
guillermo-navas-palencia commented 7 months ago

Hi @dmitry-lesnik.

First, this seems a scikit-learn issue (KBinsDiscretizer is used under the hood).

image

Second, the default prebinning_method is more robust and convenient, and I recommend only using quantile when the number of distinct values in x is large.