Closed epyx25 closed 4 years ago
Hi @epyx25 ,
I replay historical data (e.g., 30+ days) in multiple mini-batches of 1-4 days, depending on instrument. The batch sizes are based on many factors: hardware, how active the historical trading days were, and how many load_book
calls were made.
As a reminder, the approach to storing data with Arctic TickStore is to use a list of flat dictionaries to take advantage of 10x compression. This means that nested data formats received via WebSocket / REST requests need to be transformed. For example, Limit Order Book (LOB) snapshots are split into individual messages with start and end points flagged (e.g., load_book
and book_loaded
).
That said, your understanding of the process flow is correct:
WebSocket subscribes to a level 2 or 3 LOB feed
load_book
flag is inserted into database to mark the start of a new LOB snapshot
simulator.extract_features()
is used to export data in mini-batches (one CSV per trading day)
database._query_arctic()
filters the query result returning data after the first load_book
flag; all data before the flag is disregarded since the messages are partial updates and an initial LOB snapshot is required as the starting point.
Hope that answers your questions.
Hi, thank you very much for the repo. I am dealing with a similar issue, when exporting the data. I have 32gb of ram, but I am still not able to export a full day without running out of memory. What changes could I make so that this does not happen?
Thanks in advance.
In your paper you write:
Let's say I start to record data on day 1 at 18:00:00 UTC time. Since this is the first tick, the recorder requests a new order book snapshot from the exchange, and creates a tick with type
load_book
The recording runs 110 hours till day 6 08:00:00 without any issue.
Now I want to crate the features for day 4 (00:00:00 - 23:59:59). If the Simulator queries for the data, the _query_arctic will searches for the first
load_book
point https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/data_recorder/database/database.py#L90, but it won't find anything, so it returns zero tick.I saw, that the Simulator's
extract_features
method splits the data by days. https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/data_recorder/database/simulator.py#L303 and I could extract from the first day till the 5th, and then I have the 4th separated, but if I try to read so much data from the db the process gets killed because of OOM. (10GB)How did you extract the 30 days training data? In one batch? What is the recommended size of memory for that? To be honest I have not looked through the whole code yet, maybe I am missing something.