Randomness in the binning : Getting Different Bins each time

priyankamishra31 commented 10 months ago

Hi,

I'm using optbinning.BinningProcess() for automatic binning of around 100 features , and have noticed a difference in the bins obtains for some variable on each run. It's not for all the bins , but it's still large enough to be a concern. There is a randomness in the binning, even when the the dataset is same. (I initially thought the issue could be with the dataset, but when I ran the same cell in my Jupyter file twice , I got different bins for the features).

Is this something to be expected ? Can we prevent this from happening and make sure we get the same consistent bins each time ?

Would really appreciate your help with this .

Thanks !!

guillermo-navas-palencia commented 10 months ago

Hi @priyankamishra31.

This is absolutely not expected. There is no source of uncertainty during the binning process. I suspect the error comes from another part of the code. Could you please provide a reproducible example?

priyankamishra31 commented 10 months ago

Hi @guillermo-navas-palencia ,

Sorry for the delay in replying.

I can't share the reproducible code with same dataset, as the dataset I’m using is quite large. But I’m sharing the same file I'm using (attached in the email) , in case that is helpful. I tried to reproduce the issue with multiple datasets but couldn’t find a big enough dataset (around 400-500 features) to reproduce the issue.

The current difference in binning is seen in around 1-2% of total feature. And I've observed that it's not really an issue if the number of features are less (e.g. 100-200).

Priyanka

On 17 Jan 2024, at 9:55 pm, Guillermo @.***> wrote:

Hi @priyankamishra31 https://github.com/priyankamishra31.

This is absolutely not expected. There is no source of uncertainty during the binning process. I suspect the error comes from another part of the code. Could you please provide a reproducible example?

— Reply to this email directly, view it on GitHub https://github.com/guillermo-navas-palencia/optbinning/issues/299#issuecomment-1896979386, or unsubscribe https://github.com/notifications/unsubscribe-auth/AV62MSMJQDOUCQVJTY54ZFTYPBCGTAVCNFSM6AAAAABB7GMPUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJWHE3TSMZYGY. You are receiving this because you were mentioned.

guillermo-navas-palencia commented 8 months ago

I closed since a reproducible example was not shared. Please reopen if this changes.

priyankamishra31 commented 7 months ago

Hi @guillermo-navas-palencia ,

I tried to replicate the issue , and got a reproducible example. (Sharing the code file and the csv of the results exported in the email since I can't attach it here )

Binning Process: binning_process = BinningProcess(variable_names=variable_names,categorical_variables=categorical_variables, min_prebin_size=0.01,**binning_fit_params[0]) binning_process.fit(X_train,y_train,w_train)

The dataset used was : https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data?select=train.csv

And these are the binning result when running Binning Process , 3 times , without changing anything :

for examples , if you compare the files _binningresult.csv and _binning_result2.csv you'll see the difference in bins for var_14 and var_15

similarly on comparing the 3 files , I got the following difference :

I hope this helps, I'm also sharing the Jupyter notebook (with output cells) on email for more context. Thanks for your help with this help.

priyankamishra31 commented 7 months ago

Hi ,

As mentioned in the GitHub issue, sharing the related Jupyter Notebook for reference. Please let me know if you need any more information on the dataset or the process followed.

Regards, Priyanka

lcrmorin commented 6 months ago

@priyankamishra31 Can you share it publicly ? This is of concern for me as well.

guillermo-navas-palencia commented 6 months ago

@priyankamishra31 Can you share it publicly ? This is of concern for me as well.

https://github.com/guillermo-navas-palencia/optbinning/issues/314#issuecomment-2092786722

guillermo-navas-palencia / optbinning

Randomness in the binning : Getting Different Bins each time #299