kaz-Anova / StackNet

StackNet is a computational, scalable and analytical Meta modelling framework
MIT License
1.32k stars 344 forks source link

likelihood encoding and stacking #66

Closed germayneng closed 6 years ago

germayneng commented 6 years ago

Hi kazanova,

Thanks for the great work! I have one question:

When we do apply likelihood encoding, why do we have this:

The second model makes use of the new data_prefix command which tells StackNet to expect different pairs of train and validation data to run stacking on. In other words the User supplies the data per fold. This schema is used because likelihood features and counts are created within cross-validation.

i am confused at this point. If likelihood encoding is applied at a nested cv fold, example: with 10 cv, we have an inner 5 cv. Can we simply stack them instead of using data_prefix to manually do the folds? Is there a case of direct leakage?

goldentom42 commented 6 years ago

Hi @germayneng ,

I'm not @kaz-Anova but I'll try to answer your question ;-) under his control! There has been a lot of discussion on target/likelihood encoding and stacking during the Porto Seguro's competition on Kaggle. In particular CPMP made a clear point that not using the same CV folds at each stage of your process may introduce leakage. I may need some time to find the thread...

Likelihood encoding is an estimator by itself and therefore using these features can be seen as a 1st level stacking. In other words if you do use likelihood encoding, you have to perform this inside your CV loop to avoid leakage.

It's not what I've done here : https://www.kaggle.com/ogrellier/xgb-classifier-upsampling-lb-0-283?scriptVersionId=1638269

but it's what Andy Harless did here : https://www.kaggle.com/aharless/xgboost-cv-lb-284

Now StackNet has a different random number generator than say Python and if you do stacking with it and performed likelihood encoding in Python with a given CV fold seed you want to tell StackNet what these folds are.

The only way to do this is to provide StackNet with the folds directly using data_prefix and I think that is why @kaz-Anova encourages users to use data_prefix and avoid any sort of leakage.

Hope this helps and let me know if what I wrote is not clear. Cheers, Olivier