guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
435 stars 98 forks source link

BinningProcess: error in binning_transform_params parameter with metric = bins #266

Closed max-franceschi closed 7 months ago

max-franceschi commented 9 months ago

Hello, Thank you for your great work on this excellent module, really helpful.

I want to use BinningProcess to transform columns in a sklearn pipeline. However, I would like BinningProcess to return bins instead of the mean of the target variable to have meaningful bin names. To present my issue, I produce an example out of the sklearn pipeline.

My understanding of the BinningProcess documentation is that I can handle the binning output format:

  1. Either within the .transform method with the option metric = "bins".
df = pd.DataFrame({'continuous_feature': choices(range(0,30), k=95) + [np.NaN]*5,
                        'cat_feature': choices(['A', 'B', 'C'], k = 100),
                         'target' : [uniform(15,16) for x in range(0,100)]})

all_features = ["continuous_feature", "cat_feature"]
X = df.loc[:, all_features]
y = df.loc[:, 'target']

BinningProcess(all_features).fit_transform(X,y, metric = "bins")

this works fine and I obtain the desired table:

image

However, since I eventually want to use BinningProcess in a pipeline, I cannot use this .transform method option.

  1. Or within the BinningProcess function with the option binning_transform_params

The equivalent code should be:

df = pd.DataFrame({'continuous_feature': choices(range(0,30), k=95) + [np.NaN]*5,
                    'cat_feature': choices(['A', 'B', 'C'], k = 100),
                     'target' : [uniform(15,16) for x in range(0,100)]})
all_features = ["continuous_feature", "cat_feature"]

X = df.loc[:, all_features]
y = df.loc[:, 'target']

BinningProcess(all_features,
               binning_transform_params = {"continuous_feature": {"metric": "bins"},
                                           "cat_feature": {"metric": "bins"}}).fit_transform(X,y)

Unfortunately, this yields an error

ValueError: could not convert string to float: '(-inf, 4.50)'

What could I to to prevent this error?

Also, binning_transform_params work well if I use another option than "bins", e.g. "indices" {"continuous_feature": {"metric": "indices"}, "cat_feature": {"metric": "indices"}}: image

bmreiniger commented 9 months ago

The problem is that the transform arg is used to select the type of the initialized output:

https://github.com/guillermo-navas-palencia/optbinning/blob/3754085b0dfc461f97673ce9603ee6846bd988bd/optbinning/binning/binning_process.py#L1382-L1389

max-franceschi commented 9 months ago

Sorry, I'm not sure to understand. Could you detail a bit, please? Also, is there a way to circumvent this issue?

bmreiniger commented 9 months ago

I was just trying to help identify the problem in the code. I'm unfortunately not aware of a workaround, short of writing a custom wrapper class.

It could be fixed by a PR, which I'd be happy to make, except that I'm not sure what the most efficient approach might be. My first thought is to not initialize the return array as in that snippet at all, collect all the transformed columns, then concatenate them with np.c_. I'd have to check whether that works as expected with heterogeneous types.

max-franceschi commented 9 months ago

@bmreiniger Thank you for your ideas. I have never worked on a module's GitHub before, but let me know if I can help.

guillermo-navas-palencia commented 9 months ago

Hi. NumPy does not support mixed types. If you use column_stack, the resulting array will apply the type hierarchy. For example: image

Therefore, the PR is only valid when X is a pandas DataFrame.

guillermo-navas-palencia commented 9 months ago

The code could check if the individual variable transformation metrics are compatible. For example, if any is "bins" all must be "bins" otherwise raise an exception. If transforms are numeric (woe, mean, event_rate, indices)

bmreiniger commented 9 months ago

Hrm, I had assumed it would've escalated to object type for the array but left the actual values as strings and floats. I guess it's unlikely that someone will want to transform columns with different metrics, so maybe an exception is the best/easiest path forward. The message could suggest a ColumnTransformer...er, does that suffer the same type issue, or how do they solve it?

bmreiniger commented 9 months ago

Maybe check the individual variable transformation metrics, but instead of erroring on mixed bins/other, initialize the array to be object type? It could hurt performance downstream, but the user has chosen this niche mixed-metric output.

max-franceschi commented 8 months ago

Hi, I just wanted to know if there were any updates regarding this PR. Otherwise, is there a way I could hot fix the issue locally?

guillermo-navas-palencia commented 8 months ago

Hi. Currently, I don't have the time to work on this change. The fix suggested by @bmreiniger should work.

bmreiniger commented 8 months ago

The added test in my PR is already operating on a numpy array, not a dataframe. It passes because the first column's binner produces an output that has object dtype (individual entries are strings), and then the column_stack works as I had originally expected.

https://github.com/guillermo-navas-palencia/optbinning/blob/3754085b0dfc461f97673ce9603ee6846bd988bd/optbinning/binning/transformations.py#L184

guillermo-navas-palencia commented 7 months ago

Fix in the develop branch. I close it. Expect a new release before the end of the year.