Question about exporting training data

epyx25 commented 4 years ago

In your paper you write:

We export the recorded data to compressed CSV files, segregated by trading date using UTC timezone; each file is approximately 160Mb in size before compression.

Let's say I start to record data on day 1 at 18:00:00 UTC time. Since this is the first tick, the recorder requests a new order book snapshot from the exchange, and creates a tick with type load_book
The recording runs 110 hours till day 6 08:00:00 without any issue.

Now I want to crate the features for day 4 (00:00:00 - 23:59:59). If the Simulator queries for the data, the _query_arctic will searches for the first load_book point https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/data_recorder/database/database.py#L90, but it won't find anything, so it returns zero tick.

I saw, that the Simulator's extract_features method splits the data by days. https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/data_recorder/database/simulator.py#L303 and I could extract from the first day till the 5th, and then I have the 4th separated, but if I try to read so much data from the db the process gets killed because of OOM. (10GB)

How did you extract the 30 days training data? In one batch? What is the recommended size of memory for that? To be honest I have not looked through the whole code yet, maybe I am missing something.

sadighian commented 4 years ago

Hi @epyx25 ,

I replay historical data (e.g., 30+ days) in multiple mini-batches of 1-4 days, depending on instrument. The batch sizes are based on many factors: hardware, how active the historical trading days were, and how many load_book calls were made.

As a reminder, the approach to storing data with Arctic TickStore is to use a list of flat dictionaries to take advantage of 10x compression. This means that nested data formats received via WebSocket / REST requests need to be transformed. For example, Limit Order Book (LOB) snapshots are split into individual messages with start and end points flagged (e.g., load_book and book_loaded).

That said, your understanding of the process flow is correct:

Recording data

WebSocket subscribes to a level 2 or 3 LOB feed
load_book flag is inserted into database to mark the start of a new LOB snapshot

Replaying recorded data to export to a CSV

simulator.extract_features() is used to export data in mini-batches (one CSV per trading day)
database._query_arctic() filters the query result returning data after the first load_book flag; all data before the flag is disregarded since the messages are partial updates and an initial LOB snapshot is required as the starting point.

Hope that answers your questions.

martin-sn commented 3 years ago

Hi, thank you very much for the repo. I am dealing with a similar issue, when exporting the data. I have 32gb of ram, but I am still not able to export a full day without running out of memory. What changes could I make so that this does not happen?

Thanks in advance.

sadighian / crypto-rl

Question about exporting training data #20