Open bgreenwell opened 1 year ago
In the ed. 1 we talked about leakage in the feature engineering section: https://bradleyboehmke.github.io/HOML/engineering.html#data-leakage
Are you thinking keep it in that same location or move/expand to elsewhere?
Definitely expanding, and maybe reference the idea of “model info sheets”: https://reproducible.cs.princeton.edu/. I also have a few simulations illustrating the impact of leakage in the preprocessing stage (like feature selection followed by cross validation and up sampling).
Illegitimate features is another one we don’t really discuss, but seems to be a common issue.
Great resource: https://reproducible.cs.princeton.edu/
Really ties into preprocessing before data splitting and dealing with class imbalance!