Closed priyankamishra31 closed 8 months ago
Hi @priyankamishra31.
This is absolutely not expected. There is no source of uncertainty during the binning process. I suspect the error comes from another part of the code. Could you please provide a reproducible example?
Hi @guillermo-navas-palencia ,
Sorry for the delay in replying.
I can't share the reproducible code with same dataset, as the dataset I’m using is quite large. But I’m sharing the same file I'm using (attached in the email) , in case that is helpful. I tried to reproduce the issue with multiple datasets but couldn’t find a big enough dataset (around 400-500 features) to reproduce the issue.
The current difference in binning is seen in around 1-2% of total feature. And I've observed that it's not really an issue if the number of features are less (e.g. 100-200).
Priyanka
On 17 Jan 2024, at 9:55 pm, Guillermo @.***> wrote:
Hi @priyankamishra31 https://github.com/priyankamishra31.
This is absolutely not expected. There is no source of uncertainty during the binning process. I suspect the error comes from another part of the code. Could you please provide a reproducible example?
— Reply to this email directly, view it on GitHub https://github.com/guillermo-navas-palencia/optbinning/issues/299#issuecomment-1896979386, or unsubscribe https://github.com/notifications/unsubscribe-auth/AV62MSMJQDOUCQVJTY54ZFTYPBCGTAVCNFSM6AAAAABB7GMPUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJWHE3TSMZYGY. You are receiving this because you were mentioned.
I closed since a reproducible example was not shared. Please reopen if this changes.
Hi @guillermo-navas-palencia ,
I tried to replicate the issue , and got a reproducible example. (Sharing the code file and the csv of the results exported in the email since I can't attach it here )
Binning Process:
binning_process = BinningProcess(variable_names=variable_names,categorical_variables=categorical_variables, min_prebin_size=0.01,**binning_fit_params[0])
binning_process.fit(X_train,y_train,w_train)
The dataset used was : https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data?select=train.csv
And these are the binning result when running Binning Process , 3 times , without changing anything :
for examples , if you compare the files _binningresult.csv and _binning_result2.csv you'll see the difference in bins for var_14 and var_15
similarly on comparing the 3 files , I got the following difference :
I hope this helps, I'm also sharing the Jupyter notebook (with output cells) on email for more context. Thanks for your help with this help.
Hi ,
As mentioned in the GitHub issue, sharing the related Jupyter Notebook for reference. Please let me know if you need any more information on the dataset or the process followed.
Regards, Priyanka
@priyankamishra31 Can you share it publicly ? This is of concern for me as well.
@priyankamishra31 Can you share it publicly ? This is of concern for me as well.
https://github.com/guillermo-navas-palencia/optbinning/issues/314#issuecomment-2092786722
Hi,
I'm using optbinning.BinningProcess() for automatic binning of around 100 features , and have noticed a difference in the bins obtains for some variable on each run. It's not for all the bins , but it's still large enough to be a concern. There is a randomness in the binning, even when the the dataset is same. (I initially thought the issue could be with the dataset, but when I ran the same cell in my Jupyter file twice , I got different bins for the features).
Is this something to be expected ? Can we prevent this from happening and make sure we get the same consistent bins each time ?
Would really appreciate your help with this .
Thanks !!