ing-bank / skorecard

scikit-learn compatible tools for building credit risk acceptance models
https://ing-bank.github.io/skorecard/
MIT License
84 stars 23 forks source link

[Bug] IV calculations are incorrect when the target is a pd.Series with non-standard index #109

Open dlaprins opened 1 year ago

dlaprins commented 1 year ago

When computing the information value using the iv function in metrics (line 338), the resulting values are wrong if the variable y is a pd.Series with an index that is not the canonical one.

The cause of this lies in the function woe_1d (line 6). Specifically, in line 36, the concatenation goes wrong since the indices will not be identical (the index of X is reset, the index of y is set to the canonical one only for np.arrays).

The error is easily reproducible by setting X to any categorical feature, and y to 1) a boolean numpy array, and 2) the same array as a pd.Series with non-canonical index. The results will differ.

Note that the docstrings of iv, _IV_scorer and woe_1d are also inconsistent as to whether the input should be np.arrays, pd.Series, or whether both are possible. Also, woe_1d returns a pd.DataFrame rather than a dictionary.

Although fixing the bug in woe_1d requires exactly 2 lines of code change ( "else: y = y.reset_index(drop=True)" inserted on line 27), my suggestion would be to heed #89 and properly address either rewriting or discarding woe_1d in favor of using category_encoders WOEEncoder.

ReinierKoops commented 1 year ago

Latest commit offers easy fix. Long term fix would at some point need to be adressen.