ing-bank / skorecard

scikit-learn compatible tools for building credit risk acceptance models
https://ing-bank.github.io/skorecard/
MIT License
84 stars 23 forks source link

Unexpected behaviour missing value treatment "most_risky" / "least_risky" #117

Open dlaprins opened 5 months ago

dlaprins commented 5 months ago

When using a BucketingProcess, the treatment of missing values is determined by specifying missing_treatment for both the prebucketer and the bucketer. Let's consider using OptimalBucketer as the bucketer.

The functionality that would be desirable is to be able to use BucketingProcess to place missing values in the most risky bucket. This is currently not possible. When setting missing_treatment = "most_risky" for both a prebucketer and OptimalBucketer, it need not be the case that the BucketingProcess as a whole places missing values in the most risky bucket.

Consider the following situation:

Then what can happen is the following:

It sounds a bit hypothetical, but it actually occurred for on two separate occasions for me now. It is both unintuitive and undesirable.

Suggested solution: add a missing_treatment parameter to BucketingProcess which allows missing values to be reassigned after the prebucketer and bucketer have been applied.