guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
452 stars 100 forks source link

Error: Fixed user_splits are removed because produce pure prebins #285

Closed ben-herbst closed 10 months ago

ben-herbst commented 10 months ago

I need to fix the bins for certain categorical values. Your example in the tutorial is the following:

user_splits = np.array([
               ['Businessman'],
               ['Working'],
               ['Commercial associate'],
               ['Pensioner', 'Maternity leave'],
               ['State servant'],
               ['Unemployed', 'Student']], dtype=object)
optb = OptimalBinning(name=variable_cat, dtype="categorical", solver="cp",
                      user_splits=user_splits,
                      user_splits_fixed=[False, True, True, True, True, True])

optb.fit(x_cat, y_cat)

Then

opt.splits

gives

[['Businessman', 'Pensioner', 'Maternity leave'],
 ['State servant'],
 ['Commercial associate'],
 ['Working'],
 ['Unemployed', 'Student']]

This is not what I need. I need the bins to remain exactly the same as specified. If I change the following

optb = OptimalBinning(name=variable_cat, dtype="categorical", solver="cp",
                      user_splits=user_splits,
                      user_splits_fixed=[True, True, True, True, True, True])

optb.fit(x_cat, y_cat)

I get the error:

ValueError: Fixed user_splits [list(['Businessman'])] are removed because produce pure prebins. Provide different splits to be fixed.

I get the same error if I try

user_splits = np.array([
               ['Businessman'],
               ['Working'],
               ['Commercial associate'],
               ['Pensioner', 'Maternity leave'],
               ['State servant'],
               ['Unemployed'], 
               ['Student']], dtype=object)
optb = OptimalBinning(name=variable_cat, dtype="categorical", solver="cp",
                      user_splits=user_splits,
                      user_splits_fixed=[False, True, True, True, True, True, True])

optb.fit(x_cat, y_cat)

but the following seemingly similar change does not give the error

user_splits = np.array([
               ['Businessman'],
               ['Working'],
               ['Commercial associate'],
               ['Pensioner'], 
               ['Maternity leave'],
               ['State servant'],
               ['Unemployed', 'Student']], dtype=object)
optb = OptimalBinning(name=variable_cat, dtype="categorical", solver="cp",
                      user_splits=user_splits,
                      user_splits_fixed=[False, True, True, True, True, True, True])

optb.fit(x_cat, y_cat)
guillermo-navas-palencia commented 10 months ago

Hi @ben-herbst.

The reason is simple, both categories "Businessman" and "Student" are pure bins, in this case, the average target is zero. In the latter example, the first bin ["Businessman"] is not fixed, and "Student" is merged with "Unemployed". image

ben-herbst commented 10 months ago

Thanks, much appreciated!

[image: Praelexis] https://praelexis.com/ Ben Herbst Data Scientist | Machine Learning Engineer

7 Neutron Avenue ⋅ Techno Park ⋅ Stellenbosch ⋅ 7600 PO Box 3396 ⋅ Matieland ⋅ Stellenbosch ⋅ 7602 mobile: +27 83 566 4466 ⋅ office: +27 21 200 5817 website http://www.praelexis.com/| map @.,18.8270702,17z/data=!3m1!4b1!4m5!3m4!1s0x1dcdb3226a13a605:0x12022a8f60a2a6bb!8m2!3d-33.9651846!4d18.8292589> | email @.> [image: Twitter] https://twitter.com/praelexis[image: Facebook] http://www.facebook.com/praelexis[image: LinkedIn] https://www.linkedin.com/company/praelexis/ Confidentiality Note: This email may contain confidential and/or private information. If you received this email in error please delete and notify sender.

On Fri, Dec 1, 2023 at 10:26 PM Guillermo @.***> wrote:

Closed #285 https://github.com/guillermo-navas-palencia/optbinning/issues/285 as completed.

— Reply to this email directly, view it on GitHub https://github.com/guillermo-navas-palencia/optbinning/issues/285#event-11126226645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZO3U53JIAXUFXDADDAQNWLYHI4QLAVCNFSM6AAAAAA7X7TLCSVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGEZDMMRSGY3DINI . You are receiving this because you were mentioned.Message ID: <guillermo-navas-palencia/optbinning/issue/285/issue_event/11126226645@ github.com>