man-group / arctic

High performance datastore for time series and tick data
https://arctic.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
3.05k stars 584 forks source link

Read Performance on tickstore #618

Closed sfkiwi closed 5 years ago

sfkiwi commented 6 years ago

Arctic Version

1.68

Arctic Store

TickStore

Description of problem and/or code sample that reproduces the issue

I am using the tick store and have around 1.9M rows in the library. When I do a read with a daterange it takes about 12.4 seconds to return the range. There are 16 columns and its returning 37396 rows.

Having watched the 'All your base' video presentation from James Blackburn, I had the impression that the read time was much faster as he's talking about pulling millions of rows/s.

I haven't done any optimizations yet so this is just vanilla arctic tickstore. I'm running a single MongoDB, no sharding, on a MacBook Pro and I'm concurrently writing about 70-100 rows/s as I'm doing the read. CPU is below 10%

Is this the performance I should expect. What would you consider best practice setup for streaming potentially thousands of ticks/s to Arctic?

db.initialize_library('gdax', 'TickStoreV3')
lib = db['gdax']

#each tick comes in individually, writing directly, no batching
#data is a dict with between 6 and 10 k,v pairs
# parse the ISO time string using dateutil.parse
data['index'] = parse(data['time'])
lib.write(data['product_id'], [data], metadata={'lib': 'cbpro', 'source':'gdax', 'type':data['type']})
bmoscon commented 6 years ago

I'll let @jamesblackburn or @dimosped comment further, but I believe there are a few things:

  1. With tickstore you have to intelligently decide what a 'good' window size is for ticks. If you choose poorly, your performance will suffer substantially.

  2. when you write directly with no batching in tickstore, each write is creating a document (I could be a little off here since tickstore is my weakest of the 3 engines). This means when you read you'll need to read out thousands of documents for a short window (like a day). Thats very slow due to overhead. You get a lot of speed up from compression as well, which you wont get from single second/sub second ticks.

  3. Some speed will definitely come from database tuning.

sfkiwi commented 6 years ago

Thanks for the feedback. I will look at batching in perhaps hourly groupings with a redis store. Out of curiosity has there been any research into using Cassandra as the underlying db (or at least an alternate option). It would make it very easy to scale horizontally and even as a vanilla db its a pretty good tickstore if you set up the schema right from the beginning. With it configured for AP (out of CAP) you may get better read performance for historical data due to ability to read from a node that is not currently being written to. And for real-time consistency you could pull from an upstream redis which would always have the latest data. I'm definitely no db expert like all of you though :)

bmoscon commented 6 years ago

I'm far from a database expert :) I think @jamesblackburn may have looked into or evaluated cassandra once upon a time ago?

krywen commented 6 years ago

@bmoscon Is choosing TickStore chunk size is a matter of 'larger size are better because compress more' vs 'smaller size are better if you only want to access small portion of the data' ?

bmoscon commented 6 years ago

a big part of it is that if you ever need to retrieve a lot of chunks of data, its going to take a lot longer. If your chunk size is larger, retrieving years of data is very performant. If its very small, there is a lot of overhead involved and its not really worth it.