Open priyankamishra31 opened 7 months ago
Hi @guillermo-navas-palencia , I'm using 'cart' methods (same as you suggested in the comment). I thought the subsample default value issue was only for sklearn.preprocessing.KBinsDiscretizer. Is it there for the cart methods too?
I specified cart in the binning_fit_params parameter of BinningProcess()
Thanks :-)
Hi @priyankamishra31. I was able to replicate this behaviour, thanks for providing the dataset. Findings:
This nondeterministic behaviour only occurs with Google ORTools CP-SAT solver. It seems a bug:
I found this error disappears using 'mip' as a solver, so this seems a solver issue (well, not necessarily, read below).
However, in terms of IV, the CP-SAT returns the same value, i.e., the difference is below the solver's tolerance 1e-6, so in that sense, both solutions are equally valid. In order words, there are multiple optimal solutions. Using ortools version 9.9.3963 (latest version)
I understand that from a modelling perspective, this is an issue. I will fix the random_seed parameter to enforce reproducibility. Lastly, it is worth noticing that the Google ORTools team does not guarantee the same solution across versions.
Thanks @guillermo-navas-palencia , I really appreciate you looking into this.
Is there anything I could do from my side (or a work around) if I still want to use the 'cp' solver, and have a consistent result ? This would help me till we have the next version of this package.
Thanks again :-)
Unfortunately, I don't think so. If your target is binary, and you increase the min_prebin_size a bit the MIP solver should be only slightly slower than CP. In general, keeping a reasonable min_prebin_size (i.e., 0.025 - 0.05), will reduce the number of equally optimal solutions. If I find the time, I will also experiment with other MIP solvers already supported by ortools (Highs and SCIP).
Another comment about the CP solver: please bear in mind that the CP solver works with integer values, so optbinning rounds to integer after scaling (x 1e6), which incurs rounding errors if the x values are tiny. For reference: https://github.com/guillermo-navas-palencia/optbinning/blob/master/optbinning/binning/cp.py#L53
This issue is linked to : https://github.com/guillermo-navas-palencia/optbinning/issues/299 (Sorry , I didn't find the option to reopen the issue probably because I'm not a collaborator)
Hi @guillermo-navas-palencia ,
I'm using optbinning.BinningProcess() for automatic binning of around 100-200 features , and have noticed a difference in the bins obtains for some variable on each run. It's not for all the bins , but it's still large enough to be a concern. There is a randomness in the binning, even when the dataset is same. (I initially thought the issue could be with the dataset, but when I ran the same cell in my Jupyter file twice , I got different bins for the features).
The dataset used was from kaggle , and linked below. https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data?select=train.csv
I tried to replicate the issue , and got a reproducible example. (Sharing the code file and the csv of the results exported in the email since don't see an option to attach it here )
Binning Process:
binning_process = BinningProcess(variable_names=variable_names,categorical_variables=categorical_variables, min_prebin_size=0.01,**binning_fit_params[0])
binning_process.fit(X_train,y_train,w_train)
And these are the binning result when running Binning Process , 3 times , without changing anything :
for examples , if you compare the files _binningresult.csv and _binning_result2.csv you'll see the difference in bins for var_14 and var_15
similarly on comparing the 3 files , I got the following difference :
I'm using optbinning==0.18.0
Can we prevent this from happening and make sure we get the same consistent bins each time ?
I hope this helps, I'm also sharing the Jupyter notebook (with output cells) on email for more context. Thanks for your help with this help.
Thanks!