Shuffling Data - Githubissues

nayash / stock_cnn_blog_pub

This project is a loose implementation of paper "Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach"

Apache License 2.0

164 stars 95 forks source link

Shuffling Data #1

Closed personal-coding closed 4 years ago

personal-coding commented 4 years ago

When you're using time series analysis and technical indicators, you can't shuffle your data. Real time series data is not shuffled. Also, shuffling the data leaks historical data, as the technical indicators are calculated based on historical data. Your first indicator should have been your very high accuracy (i.e. if you could accurately predict 85%+ accuracy against the stock market, you'd be a rich man).

This is a very similar issue with this research: https://www.reddit.com/r/algotrading/comments/cv83yh/overfitting

nayash commented 4 years ago

I checked the code and I retrieve training data and test data from loaded csv sequentially (historical data). Then I shuffle and split the training data to train and validate. Shuffling is an issue if you leak any future data into your training data which AFAIK, I haven't done (or at least never intended, unless there is a bug). Can you point out relevant code section if you have found a bug?

personal-coding commented 4 years ago

You are inherently leaking data because technical indicators are a function of historical data. Try running your train and prediction with all shuffle=True updated to shuffle=False.

nayash commented 4 years ago

Yes. Technical indicators are function of historical data, that's why people use technical indicators as features to learn from various patterns. "Leaking" is when you use future data in your training data which would be cheating. Looking past data is not an issue.