Open howie1013 opened 2 years ago
Yep
The data leakage in this project is serious, I doubt how could those academic peer reviewed the paper...
train_size = round(len(dataset) * 0.7)
print(f'Training Data Size: {train_size}')
train_data = dataset[0:train_size]
test_data = dataset[train_size:]
X_train = pd.DataFrame(train_data)
X_test = pd.DataFrame(test_data)
y_train = pd.DataFrame(train_data['Close'])
y_test = pd.DataFrame(test_data['Close'])
# Fit & Transform Features
# Normalized the data
X_scaler = MinMaxScaler(feature_range=(-1, 1))
y_scaler = MinMaxScaler(feature_range=(-1, 1))
X_train = X_scaler.fit_transform(X_train)
y_train = y_scaler.fit_transform(y_train)
X_test = X_scaler.transform(X_test)
y_test = y_scaler.transform(y_test)
This snippet will generate the normalised data without data leakage. After this correction, the scaler will not work probably which makes the model useless.
The price in testing period is way higher than training price, using common method like z-score or MinMax will not be useful.
So either it needs to use adaptive normalisation or changing the model target (y_value) to percentage delta change or trend classification.
Data Leakage when normalize the train data and test data together?