likelihood encoding and stacking

Hi @germayneng ,

I'm not @kaz-Anova but I'll try to answer your question ;-) under his control! There has been a lot of discussion on target/likelihood encoding and stacking during the Porto Seguro's competition on Kaggle. In particular CPMP made a clear point that not using the same CV folds at each stage of your process may introduce leakage. I may need some time to find the thread...

Likelihood encoding is an estimator by itself and therefore using these features can be seen as a 1st level stacking. In other words if you do use likelihood encoding, you have to perform this inside your CV loop to avoid leakage.

It's not what I've done here : https://www.kaggle.com/ogrellier/xgb-classifier-upsampling-lb-0-283?scriptVersionId=1638269

but it's what Andy Harless did here : https://www.kaggle.com/aharless/xgboost-cv-lb-284

Now StackNet has a different random number generator than say Python and if you do stacking with it and performed likelihood encoding in Python with a given CV fold seed you want to tell StackNet what these folds are.

The only way to do this is to provide StackNet with the folds directly using data_prefix and I think that is why @kaz-Anova encourages users to use data_prefix and avoid any sort of leakage.

Hope this helps and let me know if what I wrote is not clear. Cheers, Olivier

kaz-Anova / StackNet

likelihood encoding and stacking #66