Potential data leakage issue in data_loader.py Dataset_Custom class

Issue Description

In the current implementation of the dataset partitioning within the Dataset_Custom class, there seems to be a potential risk of data leakage among the training, testing and validation sets. This is due to the boundaries (border1s and border2s) not accounting for the sequence length (seq_len) before partitioning the dataset.

Please correct me if my thought is wrong : )

Suggested Modification

To prevent data leakage and ensure that each dataset partition is working with distinct, non-overlapping data points, it would be beneficial to subtract the seq_len from the total data length before defining the partitions. Here is a suggested modification:


total_length = len(df_raw) - self.seq_len  # Subtract seq_len to avoid boundary issues
num_train = int(total_length * 0.7)
num_test = int(total_length * 0.2)
num_vali = total_length - num_train - num_test

border1s = [0, num_train, num_train + num_vali]  # Update borders to avoid overlap
border2s = [num_train, num_train + num_vali, total_length]

border1 = border1s[self.set_type]
border2 = border2s[self.set_type]

thuml / Autoformer

Potential data leakage issue in data_loader.py Dataset_Custom class #196

Issue Description

Suggested Modification