Closed germayneng closed 6 years ago
Hi @germayneng ,
I'm not @kaz-Anova but I'll try to answer your question ;-) under his control! There has been a lot of discussion on target/likelihood encoding and stacking during the Porto Seguro's competition on Kaggle. In particular CPMP made a clear point that not using the same CV folds at each stage of your process may introduce leakage. I may need some time to find the thread...
Likelihood encoding is an estimator by itself and therefore using these features can be seen as a 1st level stacking. In other words if you do use likelihood encoding, you have to perform this inside your CV loop to avoid leakage.
It's not what I've done here : https://www.kaggle.com/ogrellier/xgb-classifier-upsampling-lb-0-283?scriptVersionId=1638269
but it's what Andy Harless did here : https://www.kaggle.com/aharless/xgboost-cv-lb-284
Now StackNet has a different random number generator than say Python and if you do stacking with it and performed likelihood encoding in Python with a given CV fold seed you want to tell StackNet what these folds are.
The only way to do this is to provide StackNet with the folds directly using data_prefix and I think that is why @kaz-Anova encourages users to use data_prefix and avoid any sort of leakage.
Hope this helps and let me know if what I wrote is not clear. Cheers, Olivier
Hi kazanova,
Thanks for the great work! I have one question:
When we do apply likelihood encoding, why do we have this:
i am confused at this point. If likelihood encoding is applied at a nested cv fold, example: with 10 cv, we have an inner 5 cv. Can we simply stack them instead of using data_prefix to manually do the folds? Is there a case of direct leakage?