PurgedKFold: training events can overlap with testing events

tdoublep commented 4 years ago

Describe the bug PurgedKFold class creates folds such that events in the training set can overlap with events in the test set. In particular, such training events end during same timeframe that test events are happening (i.e., the opposite of the problem solved by embargo).

To Reproduce

import datetime
import pandas as pd
import mlfinlab as ml

# start date
start = datetime.datetime.strptime("01-01-2020", "%d-%m-%Y")

# 10 events, starting times are offset by 1 day
start_dates = [start + datetime.timedelta(days=x) for x in range(10)]

# each event lasts for 2 days
end_dates = [start + datetime.timedelta(days=2) for start in start_dates]

# create series (index=start date, values=end date)
df = pd.Series(index=start_dates, data=end_dates)

# splitter
splitter = ml.cross_validation.PurgedKFold(n_splits=2, samples_info_sets=df)

# print off folds
for k, (train_ind, test_ind) in enumerate(splitter.split(df)):
    print("------ %d-th fold ------" % (k))
    print(">> Training events:")
    print(df.iloc[train_ind])
    print(">> Test events:")
    print(df.iloc[test_ind])

Produces the following:

------ 0-th fold ------
>> Training events:
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
dtype: datetime64[ns]
>> Test events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
dtype: datetime64[ns]
------ 1-th fold ------
>> Training events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
dtype: datetime64[ns]
>> Test events:
2020-01-06   2020-01-08
2020-01-07   2020-01-09
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
dtype: datetime64[ns]

The overlapping events above are highlighted in bold.

Possible cause/solution

I believe this problem is related to the definition to test_times here: https://github.com/hudson-and-thames/mlfinlab/blob/master/mlfinlab/cross_validation/cross_validation.py#L91 In particular, the test window is set to start from the end of the first event in the test set, to the end of the last event in the test set:

test_times = pd.Series(index=[self.samples_info_sets[start_ix]], data=[self.samples_info_sets[end_ix-1]])

It could be fixed by defining the test window to start from the start for the first event in the test, to the end of the last event in the test set:

test_times = pd.Series(index=[self.samples_info_sets.index[start_ix]], data=[self.samples_info_sets[end_ix-1]])

I'm relatively new to the codebase, maybe I misuderstand something?

Package versions

$ pip freeze | grep 'mlfinlab\|pandas'
mlfinlab==0.9.3
pandas==0.25.3

Jackal08 commented 4 years ago

HI @tdoublep,

Thanks for raising this. When we go through bug week, we will have a look at this.

Best regards Jacques

boyboi86 commented 4 years ago

@tdoublep did you try and include pct_embargo = 0.01?

From my understanding, sometimes the code might run into such problems. Try including pct_embargo = 0.01 instead of 0.

Conceptually that is what pct_embargo is for. Try and see if it works?

tdoublep commented 4 years ago

From Page 107 of "Advances in Financial Machine Learning"

For those cases where purging is not able to prevent all leakage, we can impose an embargo on training observations after every test set. The embargo does not need to affect training observations prior to a test set,

The above example clearly shows a training example prior to the test set overlapping with the beginning of the test set. As stated in the OP, this is the opposite problem to that solved by embargo.

boyboi86 commented 4 years ago

@tdoublep kindly confirm the below?

------ 0-th fold ------
>> Training events:
2020-01-08   2020-01-10
2020-01-09   2020-01-11
dtype: datetime64[ns]
>> Test events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
dtype: datetime64[ns]
------ 1-th fold ------
>> Training events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
dtype: datetime64[ns]
>> Test events:
2020-01-06   2020-01-08
2020-01-07   2020-01-09
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
dtype: datetime64[ns]

tdoublep commented 4 years ago

Yes, this is looks like what I would expect given the definition of PurgedKFold

wissam124 commented 2 years ago

Hi,

I ran into the same issue recently and agree with @tdoublep that this is an error. I used the same fix that he suggested.

However, I do not get the same output as @boyboi86. In the second fold (i.e. 1-th), the bold training sample should have been purged, shouldn't it? Indeed, it's an example where we have

t(j, 0) <= t(i, 1) <= t_(j, 1)

------ 1-th fold ------
>> Training events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
**2020-01-04   2020-01-06**
dtype: datetime64[ns]
>> Test events:
2020-01-06   2020-01-08
2020-01-07   2020-01-09
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
dtype: datetime64[ns]

hudson-and-thames / mlfinlab

PurgedKFold: training events can overlap with testing events #295