guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
452 stars 100 forks source link

Randomness in the binning : Getting Different Bins each time #314

Open priyankamishra31 opened 5 months ago

priyankamishra31 commented 5 months ago

This issue is linked to : https://github.com/guillermo-navas-palencia/optbinning/issues/299 (Sorry , I didn't find the option to reopen the issue probably because I'm not a collaborator)

Hi @guillermo-navas-palencia ,

I'm using optbinning.BinningProcess() for automatic binning of around 100-200 features , and have noticed a difference in the bins obtains for some variable on each run. It's not for all the bins , but it's still large enough to be a concern. There is a randomness in the binning, even when the dataset is same. (I initially thought the issue could be with the dataset, but when I ran the same cell in my Jupyter file twice , I got different bins for the features).

The dataset used was from kaggle , and linked below. https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data?select=train.csv

I tried to replicate the issue , and got a reproducible example. (Sharing the code file and the csv of the results exported in the email since don't see an option to attach it here )

Binning Process: binning_process = BinningProcess(variable_names=variable_names,categorical_variables=categorical_variables, min_prebin_size=0.01,**binning_fit_params[0]) binning_process.fit(X_train,y_train,w_train)

And these are the binning result when running Binning Process , 3 times , without changing anything :

for examples , if you compare the files _binningresult.csv and _binning_result2.csv you'll see the difference in bins for var_14 and var_15 image

similarly on comparing the 3 files , I got the following difference :

image

I'm using optbinning==0.18.0

Can we prevent this from happening and make sure we get the same consistent bins each time ?

I hope this helps, I'm also sharing the Jupyter notebook (with output cells) on email for more context. Thanks for your help with this help.

Thanks!

guillermo-navas-palencia commented 5 months ago

See: https://github.com/guillermo-navas-palencia/optbinning/issues/310#issuecomment-2036601399

priyankamishra31 commented 5 months ago

Hi @guillermo-navas-palencia , I'm using 'cart' methods (same as you suggested in the comment). I thought the subsample default value issue was only for sklearn.preprocessing.KBinsDiscretizer. Is it there for the cart methods too?

I specified cart in the binning_fit_params parameter of BinningProcess()

image

Thanks :-)

guillermo-navas-palencia commented 5 months ago

Hi @priyankamishra31. I was able to replicate this behaviour, thanks for providing the dataset. Findings:

image

However, in terms of IV, the CP-SAT returns the same value, i.e., the difference is below the solver's tolerance 1e-6, so in that sense, both solutions are equally valid. In order words, there are multiple optimal solutions. Using ortools version 9.9.3963 (latest version) image

I understand that from a modelling perspective, this is an issue. I will fix the random_seed parameter to enforce reproducibility. Lastly, it is worth noticing that the Google ORTools team does not guarantee the same solution across versions.

priyankamishra31 commented 5 months ago

Thanks @guillermo-navas-palencia , I really appreciate you looking into this.

Is there anything I could do from my side (or a work around) if I still want to use the 'cp' solver, and have a consistent result ? This would help me till we have the next version of this package.

Thanks again :-)

guillermo-navas-palencia commented 5 months ago

Unfortunately, I don't think so. If your target is binary, and you increase the min_prebin_size a bit the MIP solver should be only slightly slower than CP. In general, keeping a reasonable min_prebin_size (i.e., 0.025 - 0.05), will reduce the number of equally optimal solutions. If I find the time, I will also experiment with other MIP solvers already supported by ortools (Highs and SCIP).

Another comment about the CP solver: please bear in mind that the CP solver works with integer values, so optbinning rounds to integer after scaling (x 1e6), which incurs rounding errors if the x values are tiny. For reference: https://github.com/guillermo-navas-palencia/optbinning/blob/master/optbinning/binning/cp.py#L53