thuml / Autoformer

About Code release for "Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting" (NeurIPS 2021), https://arxiv.org/abs/2106.13008
MIT License
2k stars 429 forks source link

Potential data leakage issue in data_loader.py Dataset_Custom class #196

Closed ceresshadows closed 1 year ago

ceresshadows commented 1 year ago

Issue Description

In the current implementation of the dataset partitioning within the Dataset_Custom class, there seems to be a potential risk of data leakage among the training, testing and validation sets. This is due to the boundaries (border1s and border2s) not accounting for the sequence length (seq_len) before partitioning the dataset.

Please correct me if my thought is wrong : )

Suggested Modification

To prevent data leakage and ensure that each dataset partition is working with distinct, non-overlapping data points, it would be beneficial to subtract the seq_len from the total data length before defining the partitions. Here is a suggested modification:


total_length = len(df_raw) - self.seq_len  # Subtract seq_len to avoid boundary issues
num_train = int(total_length * 0.7)
num_test = int(total_length * 0.2)
num_vali = total_length - num_train - num_test

border1s = [0, num_train, num_train + num_vali]  # Update borders to avoid overlap
border2s = [num_train, num_train + num_vali, total_length]

border1 = border1s[self.set_type]
border2 = border2s[self.set_type]
wuhaixu2016 commented 1 year ago

Hi, we add some constraints to the dataset length. Please see https://github.com/thuml/Autoformer/blob/main/data_provider/data_loader.py#L285 This design can avoid the data leakage.