Data Leakage when normalize the train data and test data together?

hungchun-lin / Stock-price-prediction-using-GAN

In this project, we will compare two algorithms for stock prediction. First, we will utilize the Long Short Term Memory(LSTM) network to do the Stock Market Prediction. LSTM is a powerful method that is capable of learning order dependence in sequence prediction problems. Furthermore, we will utilize Generative Adversarial Network(GAN) to make the prediction. LSTM will be used as a generator, and CNN as a discriminator. In addition, Natural Language Processing(NLP) will also be used in this project to analyze the influence of News on stock prices.

MIT License

225 stars 102 forks source link

dataset = pd.read_csv('Finaldata_with_Fourier.csv', parse_dates=['Date']) ... y_value = pd.DataFrame(dataset.iloc[:, 3]) y_scaler = MinMaxScaler(feature_range=(-1, 1)) y_scaler.fit(y_value) y_scale_dataset = y_scaler.fit_transform(y_value) X, y, yc = get_X_y(X_scale_dataset, y_scale_dataset) y_train, y_test, = split_train_test(y) yc_train, yc_test, = split_train_test(yc)

The data leakage in this project is serious, I doubt how could those academic peer reviewed the paper...

train_size = round(len(dataset) * 0.7)
print(f'Training Data Size: {train_size}')
train_data = dataset[0:train_size]
test_data = dataset[train_size:]

X_train = pd.DataFrame(train_data)
X_test = pd.DataFrame(test_data)
y_train = pd.DataFrame(train_data['Close'])
y_test = pd.DataFrame(test_data['Close'])

# Fit & Transform Features
# Normalized the data
X_scaler = MinMaxScaler(feature_range=(-1, 1))
y_scaler = MinMaxScaler(feature_range=(-1, 1))
X_train = X_scaler.fit_transform(X_train)
y_train = y_scaler.fit_transform(y_train)
X_test = X_scaler.transform(X_test)
y_test = y_scaler.transform(y_test)

This snippet will generate the normalised data without data leakage. After this correction, the scaler will not work probably which makes the model useless.

The price in testing period is way higher than training price, using common method like z-score or MinMax will not be useful.

So either it needs to use adaptive normalisation or changing the model target (y_value) to percentage delta change or trend classification.

hungchun-lin / Stock-price-prediction-using-GAN

Data Leakage when normalize the train data and test data together? #14