guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
434 stars 98 forks source link

Transform calculates WoE=0 for special_codes #291

Closed combiz closed 5 months ago

combiz commented 6 months ago

Firstly, thank you for the fantastic package.

I've noticed a bug with special codes where a WoE value of 0 is calculated by the binner.transform function on special code features when the metric = "woe". Other non-special code bins are calculated correctly. This can go undetected as the special code is handled correctly elsewhere. For example, with metric = 'bins' the transformation returns the correct bins including those for special codes, the binner.binning_table.build() correctly shows the bins and their values and the correct WoE values (non-zero), and the binning_table.plot() shows correct bin assignments and WoE values.

e.g. pd.DataFrame(binners["FEATURE"].transform(df["FEATURE"], metric = "woe")).value_counts()

 0.000000    56050   # <---- bug, should be non-zero
 0.758895    24398
 0.132909    19949
 0.411195    13739
 0.711351    10087
 0.546798    10014
-0.601100     8851
 0.333119     8540
Name: count, dtype: int64

This is despite the WoE=0 values being correctly assigned a special code using the equivalent command pd.DataFrame(binners["FEATURE"].transform(df["FEATURE"], metric = "bins")).value_counts() and the binning_table showing the correct WoE for this special code is non-zero.

Apologies I don't currently have the bandwidth for a full reprex but hopefully this helps.

guillermo-navas-palencia commented 6 months ago

Hi @combiz.

The transform method has a couple of parameters you can have a look at, see: https://gnpalencia.org/optbinning/binning_binary.html#optbinning.OptimalBinning.transform. The parameters are metric_special and metric_missing, by default both are set to 0. To use the actual WoE values for special just set metric_special="empirical". The default value is not automatically set to "empirical" because it might produce infinite IV if there are no special or missing values.

guillermo-navas-palencia commented 5 months ago

I close this issue, please re-open if the explanation was unclear.