man-group / arctic

High performance datastore for time series and tick data
https://arctic.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
3.06k stars 583 forks source link

VersionStore appends very slow in symbol with many rows #625

Open bmoscon opened 6 years ago

bmoscon commented 6 years ago

Symbol consists of a datetime indexed dataframe (single index column, date). There are 7 other data columns:

1 of type int64, 1 of type float64, and 5 of type object/string.

Once there were more than 50m-60m rows, the appends started taking very large amounts of time - 10-20 minutes. The appends were on the order of 90k-150k rows.

I've attached a code profiler output that shows where the bottlenecks are. Some are in arctic, some are on the network side. the machines are connected on gige links, so the fact that so much time was spent on network traffic leads me to believe large parts of the data are being read back and forth (not sure why that would be). profile.txt

bmoscon commented 6 years ago

I'll try and generate some test code that reproduces the issue

bmoscon commented 6 years ago
import random
import time
import uuid
import sys

from arctic import Arctic
import pandas as pd

def gen_data_no_str(start: str, stop: str):
    interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
    for i in interval:
        start = i.left
        end = i.right

        length = random.randint(90000, 150000)
        index = pd.date_range(start=start, end=end, periods=length)
        df = pd.DataFrame({'venue': [146342] * length,
                           'symbol': [7777] * length,
                           'price': [random.randint(1, 100) + random.random() for _ in range(length)],
                           'amount': [random.randint(1, 1000) for _ in range(length)],
                           'id': [random.randint(1, 10000000000000) for _ in range(length)],
                           'side': [random.randint(0,1) for _ in range(length)],
        }, index=pd.DatetimeIndex(index, name='date'))

        yield df

def gen_data(start: str, stop: str):
    def side():
        if random.randint(1,2) == 1:
            return 'BUY'
        return 'SIDE'
    interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
    for i in interval:
        start = i.left
        end = i.right

        length = random.randint(90000, 150000)
        index = pd.date_range(start=start, end=end, periods=length)
        df = pd.DataFrame({'timestamp': [str(i) for i in index],
                           'venue': ['NYSE'] * length,
                           'symbol': ['QQQ'] * length,
                           'price': [random.randint(1, 100) + random.random() for _ in range(length)],
                           'amount': [random.randint(1, 1000) for _ in range(length)],
                           'id': [str(uuid.uuid4()) for _ in range(length)],
                           'side': [side() for _ in range(length)],
        }, index=pd.DatetimeIndex(index, name='date'))

        yield df

def main(f):
    random.seed(time.time())
    a = Arctic('127.0.0.1')
    if 'repro' in a.list_libraries():
        a.delete_library('repro')
    a.initialize_library('repro')

    lib = a['repro']
    size = 0
    for data in f('2018-01-01', '2018-09-01'):
        start = time.time()
        lib.append('test-data', data)
        end = time.time()
        size += len(data)
        print("Wrote dataframe of len {}. Took {} seconds".format(len(data), end-start))
        print("Total size: {}".format(size))

if __name__ == '__main__':
    func = gen_data_no_str
    if len(sys.argv) == 2 and sys.argv[1] == 'str':
        func = gen_data

    main(func)

You can run the code as-is and it will generate non string dataframes. The time still grows linearly for each append, but more slowly than it does for the dataframe with strings. You can enable that by adding str to the command line after the name of the python script.

bmoscon commented 6 years ago

@dimosped @jamesblackburn

bmoscon commented 6 years ago

For datafarmes with no strings, its taking about 3 to 3.2 seconds per append by the time it gets to 17m rows. (From a start of about 0.2 seconds).

For dataframes with strings it starts off similarly (0.5 seconds per append at start) and makes it to 3+ seconds an append by the time it gets to about 5m rows. By the time it hits 17m rows its taking about 10 seconds.

bmoscon commented 6 years ago

also, not sure what happened here, but major spike in append times!

note: total size is total number of rows in the symbol

Wrote dataframe of len 137756. Took 35.16470694541931 seconds
Total size: 45751841
Wrote dataframe of len 90127. Took 41.80634117126465 seconds
Total size: 45841968
Wrote dataframe of len 96921. Took 55.48118996620178 seconds
Total size: 45938889
Wrote dataframe of len 97367. Took 77.72691798210144 seconds
Total size: 46036256
Wrote dataframe of len 146128. Took 131.6923909187317 seconds
Total size: 46182384
Wrote dataframe of len 145066. Took 169.6080777645111 seconds
Total size: 46327450
Wrote dataframe of len 131074. Took 227.51464676856995 seconds
Total size: 46458524
bmoscon commented 5 years ago

@dimosped i know we discussed this a while back - any chance we can try and narrow this down and fix it?