man-group / arctic

High performance datastore for time series and tick data
https://arctic.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
3.05k stars 583 forks source link

I need some guide: w/r at maximum speed with Tickstorage #911

Open vargaspazdaniel opened 3 years ago

vargaspazdaniel commented 3 years ago

Arctic Version

'1.80.0'

Arctic Store

# VersionStore, TickStore, or ChunkStore

Platform and version

Windows 10 x64, Intel I7-6700, 32GB RAM.

Description of problem and/or code sample that reproduces the issue

Is not really an issue, but a problem by my side. I have a python script writing tick data from different assets in real time and I'm not having any problem writing every tick... The problem is that I'm writing EVERY tick that I receive in this way and my read speed is VERY LOW:

from arctic import Arctic
from arctic import TICK_STORE

store = Arctic("localhost")
store.initialize_library("Darwinex_Tick_Data", lib_type=TICK_STORE)

lib_tick_dwx = store["Darwinex_Tick_Data"]
lib_tick.write("EURUSD", tick_dataframe)

Notice that tick_dataframe is a dataframe with a single row (the tick) parsed with timestamp and written in MongoDB as a document. I'm not having any problem writting the data in this way, but after read some closed threads here I see that the efficient approach is to save, at least, 100k rows or ticks in just ONE document.

Any advise about how to do it? Keep the tick data saved in a dataframe and if len > 100k = write ticks in a document and then store 100k ticks more and write it in document? I'm and very noob user yet...

How Can I merge all single ticks in just one document for faster reads operations? Maybe this can be a solution for me.

Any other recommendation? Thanks in advanced for your reads and also for this awesome library.

vargaspazdaniel commented 3 years ago

I already took my tick database, loaded entirely in dataframe and then write it again in Mongo (now I have 100k rows, by default, per document and not 1 tick per document as before). It seems that speed has performed better. I'll measure the differences and post it here.

vargaspazdaniel commented 3 years ago

Reading speed test

Okay, grouping ticks in chunks of 100k rows, 100k ticks = 1 document, I'm getting 166.57 seconds for 16063229 rows x 2 columns, while without grouping ticks, 1 tick = 1 document, for the same amount of data, I'm getting 1108.18 seconds.

Definetely a huge difference... Now I need to think a way on how to group 100k ticks before save it in a document... Any idea? In my case, tick data is not quick enough to fill quickly that 100k rows so IDK how to save that ticks while they grow as 100k rows, because if something happens and algo gets down, I can lose 99.999 ticks that are not written in Mongo.

PS: also another important advange is that the size of the BD has decreased from 1060 MB to 73 MB, insane...

dominiccooney commented 2 years ago

Now I need to think a way on how to group 100k ticks before save it in a document... Any idea?

Collect it in a different data store and then copy it to Arctic when you have accumulated enough rows. For example you could collect it in Redis with journaling.

PS: also another important advange is that the size of the BD has decreased from 1060 MB to 73 MB, insane...

LZ compression works, approximately, by finding backreferences to content it already compressed and emitting a reference to the previous content instead of repeating it. Arctic compresses all the rows in a column. When you only write a single row at a time you are compressing a single value at a time (one row, one column) so there's little context to find repeated content in.

vargaspazdaniel commented 2 years ago

Now I need to think a way on how to group 100k ticks before save it in a document... Any idea?

Collect it in a different data store and then copy it to Arctic when you have accumulated enough rows. For example you could collect it in Redis with journaling.

PS: also another important advange is that the size of the BD has decreased from 1060 MB to 73 MB, insane...

LZ compression works, approximately, by finding backreferences to content it already compressed and emitting a reference to the previous content instead of repeating it. Arctic compresses all the rows in a column. When you only write a single row at a time you are compressing a single value at a time (one row, one column) so there's little context to find repeated content in.

Thanks a lot for your advice!