nateemma / strategies

Custom trading strategies using the freqtrade framework
329 stars 88 forks source link

data contamination with Shuffle=True in train_test_split #8

Open ThomasGoud opened 1 year ago

ThomasGoud commented 1 year ago

Hello,

Thank you for your incredible repository very interesting! When analysing your code, especially binanceus/PCA.py, I'm wondering why the train/test dataframe for model training is shuffle: df_train, df_test, res_train, res_test = train_test_split(df, labels, train_size=0.8, random_state=27, shuffle=True)

Isn't there a risk a contaminating the model with future data (looking into the future)?

Thanks, Thomas

nateemma commented 1 year ago

Hi Thomas,

that is a very good question. The code is indeed looking into the future, but in this case it is intentional. I use the historical data to identify buy and sell signals (looking ahead), then train the detection algorithm on that data and the signals. The actual prediction mechanism (on the most recent samples) does not look at future data however, so it should be OK (though backtesting might be optimistic). If you look in DataframePopulator, you will see that anything that potentially looks into the future is in a separate function (add_future_data). Then, in the main class (PCA) you should see that any such future data is held in a separate dataframe (usually called future_df), which is not visible to the prediction code

In this particular case, I enabled shuffle because the data points are 'biased' to include sufficient buy/sell signals to train the detection algorithms and so are not a true timeseries, and I turned on shuffle to get a (somewhat) random selection of those points. If you look at other families of algorithms (Anomaly, NNBC and NNPredict), you will see that I do not do this because I need the data to be real timeseries (i.e. in order, no missing data). . Accordingly, the big problem with these other algorithms is that buy/sell signals only constitute about 1% of the data, so many algorithms will not rain well. To get around that, I typically train the algorithms over very long periods of time and then save the weights for use in real-time.

Hope that helps,

Cheers,

Phil

On Wed, Feb 8, 2023 at 2:49 AM Thomas G @.***> wrote:

Hello,

Thank you for your incredible repository very interesting! When analysing your code, especially binanceus/PCA.py, I'm wondering why the train/test dataframe for model training is shuffle: df_train, df_test, res_train, res_test = train_test_split(df, labels, train_size=0.8, random_state=27, shuffle=True)

Isn't there a risk a contaminating the model with future data (looking into the future)?

Thanks, Thomas

— Reply to this email directly, view it on GitHub https://github.com/nateemma/strategies/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4X56G7W5M6UCP5OJZ3ATWWN247ANCNFSM6AAAAAAUVDADDM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ThomasGoud commented 1 year ago

Hello Phil,

Thanks for your reply, after inspecting your code and rereading your explanation it is now clear how it is working.

The only blackpoint I see, if I understood well, is that the backtesting will occur on points where the classifier has been trained.

Do you consider that these codes can be run in trading mode (dry_run)? Because it seems that populate_indicators() method will be called every 5 second and the PCA / Classifier training will be made over and over. Would you advice to add a time check to restart the PCA / CLF training every 1h for example? Finally, wouldn't it help to avoid overfitting to have only one PCA/CLF over every coins instead of one per coin?

Thanks in advance,

Thomas

nateemma commented 1 year ago

Again, those are very good questions, some answers below:

The only blackpoint I see, if I understood well, is that the backtesting will occur on points where the classifier has been trained.

Yes, but that's the only data I have, so I can't avoid it. For the PCA algorithms, I take a random sampling of available data for fitting, so it doesn't really match the data that would be seen when assessing buy/sell criteria (but it is not the best situation). This is actually why I started looking at alternate types of strategy such as neural networks and anomaly detection.

For the neural network strategies, the training doesn't happen on all of the data, some of it is reserved for testing - you can also train on older data, but test on newer data, which the training run didn't see (I started doing this recently). Also, neural networks don't really 'remember' the data, they just set the model parameters to values that produce the best results (lowest loss) when comparing predicted values to actual values (the buy/sell signals created by looking ahead). Additionally, see the answer to your last question.

Do you consider that these codes can be run in trading mode (dry_run)? Because it seems that populate_indicators() method will be called every 5 second and the PCA / Classifier training will be made over and over.

Training for neural network models only happens in backtest mode. For the PCA strategies, I cannot avoid this because the underlying classifiers do not support cumulative fitting, i.e. I have to refit every time otherwise I wouldn't have a valid model. However, these algorithms are quite fast, and I have not had any issues running in trading modes.

Would you advice to add a time check to restart the PCA / CLF training every 1h for example?

I have not seen a need to do this. However, there is already some randomness built in, where refitting does not happen on every call (a random number is generated after fitting, and re-fitting does not occur again until that many calls have happened).

Finally, wouldn't it help to avoid overfitting to have only one PCA/CLF over every coins instead of one per coin?

Unfortunately, this is not possible for the PCA (or most of the Anomaly Detection) algorithms, because the underlying classifiers cannot be cumulatively fitted - so I cannot combine data from one pair with data from another.

For the neural network strategies (NNBC*, NNPredict*) this option does exist (just set model_per_pair to False, which is the default)

Hope that helps

Thanks,

Phil

On Tue, Feb 21, 2023 at 4:53 AM Thomas G @.***> wrote:

Hello Phil,

Thanks for your reply, after inspecting your code and rereading your explanation it is now clear how it is working.

The only blackpoint I see, if I understood well, is that the backtesting will occur on points where the classifier has been trained.

Do you consider that these codes can be run in trading mode (dry_run)? Because it seems that populate_indicators() method will be called every 5 second and the PCA / Classifier training will be made over and over. Would you advice to add a time check to restart the PCA / CLF training every 1h for example? Finally, wouldn't it help to avoid overfitting to have only one PCA/CLF over every coins instead of one per coin?

Thanks in advance,

Thomas

— Reply to this email directly, view it on GitHub https://github.com/nateemma/strategies/issues/8#issuecomment-1438431361, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4X5YNCARIAX4OTKNDXT3WYS3F7ANCNFSM6AAAAAAUVDADDM . You are receiving this because you commented.Message ID: @.***>