Open bmoscon opened 6 years ago
I'll try and generate some test code that reproduces the issue
import random
import time
import uuid
import sys
from arctic import Arctic
import pandas as pd
def gen_data_no_str(start: str, stop: str):
interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
for i in interval:
start = i.left
end = i.right
length = random.randint(90000, 150000)
index = pd.date_range(start=start, end=end, periods=length)
df = pd.DataFrame({'venue': [146342] * length,
'symbol': [7777] * length,
'price': [random.randint(1, 100) + random.random() for _ in range(length)],
'amount': [random.randint(1, 1000) for _ in range(length)],
'id': [random.randint(1, 10000000000000) for _ in range(length)],
'side': [random.randint(0,1) for _ in range(length)],
}, index=pd.DatetimeIndex(index, name='date'))
yield df
def gen_data(start: str, stop: str):
def side():
if random.randint(1,2) == 1:
return 'BUY'
return 'SIDE'
interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
for i in interval:
start = i.left
end = i.right
length = random.randint(90000, 150000)
index = pd.date_range(start=start, end=end, periods=length)
df = pd.DataFrame({'timestamp': [str(i) for i in index],
'venue': ['NYSE'] * length,
'symbol': ['QQQ'] * length,
'price': [random.randint(1, 100) + random.random() for _ in range(length)],
'amount': [random.randint(1, 1000) for _ in range(length)],
'id': [str(uuid.uuid4()) for _ in range(length)],
'side': [side() for _ in range(length)],
}, index=pd.DatetimeIndex(index, name='date'))
yield df
def main(f):
random.seed(time.time())
a = Arctic('127.0.0.1')
if 'repro' in a.list_libraries():
a.delete_library('repro')
a.initialize_library('repro')
lib = a['repro']
size = 0
for data in f('2018-01-01', '2018-09-01'):
start = time.time()
lib.append('test-data', data)
end = time.time()
size += len(data)
print("Wrote dataframe of len {}. Took {} seconds".format(len(data), end-start))
print("Total size: {}".format(size))
if __name__ == '__main__':
func = gen_data_no_str
if len(sys.argv) == 2 and sys.argv[1] == 'str':
func = gen_data
main(func)
You can run the code as-is and it will generate non string dataframes. The time still grows linearly for each append, but more slowly than it does for the dataframe with strings. You can enable that by adding str
to the command line after the name of the python script.
@dimosped @jamesblackburn
For datafarmes with no strings, its taking about 3 to 3.2 seconds per append by the time it gets to 17m rows. (From a start of about 0.2 seconds).
For dataframes with strings it starts off similarly (0.5 seconds per append at start) and makes it to 3+ seconds an append by the time it gets to about 5m rows. By the time it hits 17m rows its taking about 10 seconds.
also, not sure what happened here, but major spike in append times!
note: total size
is total number of rows in the symbol
Wrote dataframe of len 137756. Took 35.16470694541931 seconds
Total size: 45751841
Wrote dataframe of len 90127. Took 41.80634117126465 seconds
Total size: 45841968
Wrote dataframe of len 96921. Took 55.48118996620178 seconds
Total size: 45938889
Wrote dataframe of len 97367. Took 77.72691798210144 seconds
Total size: 46036256
Wrote dataframe of len 146128. Took 131.6923909187317 seconds
Total size: 46182384
Wrote dataframe of len 145066. Took 169.6080777645111 seconds
Total size: 46327450
Wrote dataframe of len 131074. Took 227.51464676856995 seconds
Total size: 46458524
@dimosped i know we discussed this a while back - any chance we can try and narrow this down and fix it?
Symbol consists of a datetime indexed dataframe (single index column, date). There are 7 other data columns:
1 of type int64, 1 of type float64, and 5 of type object/string.
Once there were more than 50m-60m rows, the appends started taking very large amounts of time - 10-20 minutes. The appends were on the order of 90k-150k rows.
I've attached a code profiler output that shows where the bottlenecks are. Some are in arctic, some are on the network side. the machines are connected on gige links, so the fact that so much time was spent on network traffic leads me to believe large parts of the data are being read back and forth (not sure why that would be). profile.txt