Open tdoublep opened 4 years ago
HI @tdoublep,
Thanks for raising this. When we go through bug week, we will have a look at this.
Best regards Jacques
@tdoublep did you try and include pct_embargo = 0.01?
From my understanding, sometimes the code might run into such problems. Try including pct_embargo = 0.01 instead of 0.
Conceptually that is what pct_embargo is for. Try and see if it works?
From Page 107 of "Advances in Financial Machine Learning"
For those cases where purging is not able to prevent all leakage, we can impose an embargo on training observations after every test set. The embargo does not need to affect training observations prior to a test set,
The above example clearly shows a training example prior to the test set overlapping with the beginning of the test set. As stated in the OP, this is the opposite problem to that solved by embargo.
@tdoublep kindly confirm the below?
------ 0-th fold ------
>> Training events:
2020-01-08 2020-01-10
2020-01-09 2020-01-11
dtype: datetime64[ns]
>> Test events:
2020-01-01 2020-01-03
2020-01-02 2020-01-04
2020-01-03 2020-01-05
2020-01-04 2020-01-06
2020-01-05 2020-01-07
dtype: datetime64[ns]
------ 1-th fold ------
>> Training events:
2020-01-01 2020-01-03
2020-01-02 2020-01-04
2020-01-03 2020-01-05
2020-01-04 2020-01-06
dtype: datetime64[ns]
>> Test events:
2020-01-06 2020-01-08
2020-01-07 2020-01-09
2020-01-08 2020-01-10
2020-01-09 2020-01-11
2020-01-10 2020-01-12
dtype: datetime64[ns]
Yes, this is looks like what I would expect given the definition of PurgedKFold
Hi,
I ran into the same issue recently and agree with @tdoublep that this is an error. I used the same fix that he suggested.
However, I do not get the same output as @boyboi86. In the second fold (i.e. 1-th), the bold training sample should have been purged, shouldn't it? Indeed, it's an example where we have
t(j, 0) <= t(i, 1) <= t_(j, 1)
------ 1-th fold ------
>> Training events:
2020-01-01 2020-01-03
2020-01-02 2020-01-04
2020-01-03 2020-01-05
**2020-01-04 2020-01-06**
dtype: datetime64[ns]
>> Test events:
2020-01-06 2020-01-08
2020-01-07 2020-01-09
2020-01-08 2020-01-10
2020-01-09 2020-01-11
2020-01-10 2020-01-12
dtype: datetime64[ns]
Describe the bug PurgedKFold class creates folds such that events in the training set can overlap with events in the test set. In particular, such training events end during same timeframe that test events are happening (i.e., the opposite of the problem solved by embargo).
To Reproduce
Produces the following:
The overlapping events above are highlighted in bold.
Possible cause/solution
I believe this problem is related to the definition to
test_times
here: https://github.com/hudson-and-thames/mlfinlab/blob/master/mlfinlab/cross_validation/cross_validation.py#L91 In particular, the test window is set to start from the end of the first event in the test set, to the end of the last event in the test set:It could be fixed by defining the test window to start from the start for the first event in the test, to the end of the last event in the test set:
I'm relatively new to the codebase, maybe I misuderstand something?
Package versions