sadighian / crypto-rl

Deep Reinforcement Learning toolkit: record and replay cryptocurrency limit order book data & train a DDQN agent
858 stars 236 forks source link

Question about exporting training data #20

Closed epyx25 closed 4 years ago

epyx25 commented 4 years ago

In your paper you write:

We export the recorded data to compressed CSV files, segregated by trading date using UTC timezone; each file is approximately 160Mb in size before compression.

Now I want to crate the features for day 4 (00:00:00 - 23:59:59). If the Simulator queries for the data, the _query_arctic will searches for the first load_book point https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/data_recorder/database/database.py#L90, but it won't find anything, so it returns zero tick.

I saw, that the Simulator's extract_features method splits the data by days. https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/data_recorder/database/simulator.py#L303 and I could extract from the first day till the 5th, and then I have the 4th separated, but if I try to read so much data from the db the process gets killed because of OOM. (10GB)

How did you extract the 30 days training data? In one batch? What is the recommended size of memory for that? To be honest I have not looked through the whole code yet, maybe I am missing something.

sadighian commented 4 years ago

Hi @epyx25 ,

I replay historical data (e.g., 30+ days) in multiple mini-batches of 1-4 days, depending on instrument. The batch sizes are based on many factors: hardware, how active the historical trading days were, and how many load_book calls were made.

As a reminder, the approach to storing data with Arctic TickStore is to use a list of flat dictionaries to take advantage of 10x compression. This means that nested data formats received via WebSocket / REST requests need to be transformed. For example, Limit Order Book (LOB) snapshots are split into individual messages with start and end points flagged (e.g., load_book and book_loaded).

That said, your understanding of the process flow is correct:

  1. Recording data
  1. Replaying recorded data to export to a CSV

Hope that answers your questions.

martin-sn commented 3 years ago

Hi, thank you very much for the repo. I am dealing with a similar issue, when exporting the data. I have 32gb of ram, but I am still not able to export a full day without running out of memory. What changes could I make so that this does not happen?

Thanks in advance.