Closed personal-coding closed 4 years ago
I checked the code and I retrieve training data and test data from loaded csv sequentially (historical data). Then I shuffle and split the training data to train and validate. Shuffling is an issue if you leak any future data into your training data which AFAIK, I haven't done (or at least never intended, unless there is a bug). Can you point out relevant code section if you have found a bug?
You are inherently leaking data because technical indicators are a function of historical data. Try running your train and prediction with all shuffle=True
updated to shuffle=False
.
Yes. Technical indicators are function of historical data, that's why people use technical indicators as features to learn from various patterns. "Leaking" is when you use future data in your training data which would be cheating. Looking past data is not an issue.
When you're using time series analysis and technical indicators, you can't shuffle your data. Real time series data is not shuffled. Also, shuffling the data leaks historical data, as the technical indicators are calculated based on historical data. Your first indicator should have been your very high accuracy (i.e. if you could accurately predict 85%+ accuracy against the stock market, you'd be a rich man).
This is a very similar issue with this research: https://www.reddit.com/r/algotrading/comments/cv83yh/overfitting