Closed CarloLepelaars closed 1 year ago
Part of me feels a bit uneasy about having a train set that's in the future of the test set. Isn't that theoretically constantly leaking a bit of info that you don't have available at prediction time?
I fully agree, the class should ensure no train slices are in the future of the test slice. Is there some bug in the example code that leads you to believe some train sets are in the future of the test sets using this strategy?
5 Folds in PurgedGroupTimeSeriesSplit
would be structured like this (See 5 top bars):
Image source: https://www.kaggle.com/code/marketneutral/purged-time-series-cv-xgboost-optuna?scriptVersionId=49427817
This snapshot in the video had me think otherwise, but I barely watched the video.
Just to check, isn't what you're suggesting very similar to this then?
Interesting, wasn't aware of TimeGapSplit
. That is indeed very similar, probably even exactly the same.
Knowing this I think we can close the issue, because we already have TimeGapSplit
. If implementing the combinatorial purged CV strategy has merit we can open a new issue for that.
Inspired by the implementation of
GroupTimeSeriesSplit
(#537) I would like to propose addingPurgedGroupTimeSeriesSplit
to the cross-validation (CV) strategies inscikit-lego
.PurgedGroupTimeSeriesSplit
allows for gaps in CV groups and has a couple of benefits:Options for implementing this in
scikit-lego
:GroupTimeSeriesSplit
(#540) class.PurgedGroupTimeSeriesSplit
inherit fromGroupTimeSeriesSplit
_BaseKFold
.At the moment I'm not sure what the best way to go is for implementation.
Example implementation of
PurgedGroupTimeSeriesSplit
by Yirun Zhang. Code snippet source: https://www.kaggle.com/code/gogo827jz/jane-street-supervised-autoencoder-mlpVideo explanation of purged cross validation: https://www.youtube.com/watch?v=hDQssGntmFA
P.S. There is also a "combinatorial" version of this purged cross-validation strategy (CPCV) that we could consider in a subsequent feature request, but for that it is important to first build the plain
PurgedGroupTimeSeriesSplit
. Article on CPCV: https://towardsai.net/p/l/the-combinatorial-purged-cross-validation-method