sadighian / crypto-rl

Deep Reinforcement Learning toolkit: record and replay cryptocurrency limit order book data & train a DDQN agent
850 stars 235 forks source link

Out of memory and "load_book" not being inserted #22

Open martin-sn opened 3 years ago

martin-sn commented 3 years ago

Hi, once again, thank you very much for the repo.

I have been having an issue, where i am unable to export a day of BTC-USD because the process gets killed because i run out of memory, even with 32gb of memory. I have commented on https://github.com/sadighian/crypto-rl/issues/20 but i am now opening a ticket as i have more issues.

I have been trying to modify the code such that i can query and export less than a days worth of data at a time (to get around the memory problem) and it has been somewhat successful. However, i have an issue with the _query_artic function. If i understand the project correctly, there should be inserted a "load_book" in the database after every snapshot (e.g. every second with the default settings), however, when i try to query less than a days worth at a time, no "load_book" is being found, and thus i get an exception that my query contains no data (Because the start index is not existent because the cursor.loc does not find a load_book).

So, i have been querying the data just using commands form Artic, and in my data, it seems that "load_book" is only being inserted once when i start recorder.py and then never again.

Am i misunderstanding how the project works? Or is there an issue here?

I will be grateful for any help you may be able to provide.

sadighian commented 3 years ago

Hi Martin,

Sorry to hear you're experiencing "out of memory" challenges when exporting recorded data to a CSV. From what you've described, there are no technical problems with the code base or setup on your machine. However, there are a few ways you can modify the code to prevent running out of memory when exporting data.

Your understanding of the data processing and capture is correct: the "load_book" flag is inserted every time a LOB snapshot is processed (i.e., either when the WebSocket (re)connects, OR out-of-sequence messages are received, thereby triggering a LOB snapshot reload).

There are two solutions possible that come to mind:

  1. Create a trigger to invoke an event that loads a new LOB snapshot more frequently. By increasing the number of "load_book" flags into your database, you'll have more "checkpoints" to use for reconstructing the LOB when exporting to a CSV. This trigger could be implemented as a counter of that is set off every n messages, or seconds.
  2. Insert LOB snapshots directly into the database, opposed to individual order messages. This approach would reduce your data footprint significantly (i.e., save 86,400 LOB snapshots per day, opposed to 1-10MM individual order update messages), but would require you to know the snapshot frequency ahead of time (i.e., once per second) and (i) create a new MongoDB collection for the LOB snapshot data, (ii) extend the Database class to perform read/write operations for LOB snapshot data, and (iii) update the Recorder class to pass the LOB snapshots to the database.

Hope this answers your question!

martin-sn commented 3 years ago

Hi Sadighian,

Thank you very much for the reply.

Option 1 is the solution i am going for. But should the load_book flag not already be inserted frequently according to the specified snapshot rate?

I have been looking in to creating the trigger, it should be placed here https://github.com/sadighian/crypto-rl/blob/arctic-streaming-ticks-full/recorder.py#L69 right? But i can't quite figure out which function exactly i should call to insert the load_book flag.

Once again, thank you for the help.