koalaverse / homlr-2ed

Code and resources for the 2nd edition of "Hands-on Machine Learning with R: An applied book covering the fundamentals of machine learning with R" by Boehmke & Greenwell
https://koalaverse.github.io/homlr-2ed/
MIT License
6 stars 0 forks source link

Improved discussion on leakage #1

Open bgreenwell opened 1 year ago

bgreenwell commented 1 year ago

Great resource: https://reproducible.cs.princeton.edu/

Really ties into preprocessing before data splitting and dealing with class imbalance!

bradleyboehmke commented 1 year ago

In the ed. 1 we talked about leakage in the feature engineering section: https://bradleyboehmke.github.io/HOML/engineering.html#data-leakage

Are you thinking keep it in that same location or move/expand to elsewhere?

bgreenwell commented 1 year ago

Definitely expanding, and maybe reference the idea of “model info sheets”: https://reproducible.cs.princeton.edu/. I also have a few simulations illustrating the impact of leakage in the preprocessing stage (like feature selection followed by cross validation and up sampling).

bgreenwell commented 1 year ago

Illegitimate features is another one we don’t really discuss, but seems to be a common issue.