guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
435 stars 98 forks source link

Missing target records all considered as 'Events' #234

Closed chapmanh closed 1 year ago

chapmanh commented 1 year ago

Hi, thanks again for providing this library. It's helped me immensely!

I wanted to raise something around null values when fitting data. It appears that fitting with records which have a missing target value places the record in the 'Missing' sections, but automatically considers the records to be events. This can drastically skew the IVs.

Perhaps it would be clearer and more appropriate if only records with a valid targets contributed to WOE and IV calculations, and if a warning was raised to the user if null target records have been provided?

I've redacted my dataset, but here's a quick example of a metric with about 13% missing target values. For context, I have a dataset with variables for every record, but there are some instances where it is not appropriate to include a particular record.

from optbinning import OptimalBinning

def fit_bin(df, variable, target, special_codes, remove_null = True):

    if remove_null == True :
        df  = df[[variable, target]][df[target].notnull()]

    x = df[variable]
    y = df[target]
    dtype = 'categorical' if str(x.dtypes) == 'object' else 'numerical'
    optb = OptimalBinning(name=variable, dtype=dtype, solver='cp',
                          special_codes=special_codes)

    optb.fit(x,y)
    return optb
optb = fit_bin(df, v, t, get_special_values(dd, v), remove_null=True)
optb.binning_table.build()

yields: image

whereas;

optb_nulls = fit_bin(df, v, t, get_special_values(dd, v), remove_null=False)`
optb_nulls.binning_table.build()`

yields: image

A huge inflation of the IV due to - what I believe to be - the skew in the distribution of events within each bin. Thanks, H

guillermo-navas-palencia commented 1 year ago

Hi @chapmanh.

The target cannot contain NaN, however, this is only checked if you set check_input to True when fitting, simply due to performance benefits:

optb = OptimalBinning()
optb.fit(x, y, check_input=True)

If you pass NaN in target, which arguably does not make sense, this is the undesired behavior.