Experiencing extremely slow reads on TickStore -- fully executable example included.

jeffneuen commented 3 years ago

Arctic Version

arctic==1.79.4
pandas=1.1.5

Arctic Store

TickStore

Platform and version

Ubuntu Linux 20.04, Python 3.8.8 (Anaconda), running JupyterLab Modern CPU w/ NVMe

Description of problem and/or code sample that reproduces the issue

I am experiencing very slow Tickstore reads. In my sample code below, the write operation clocks at 1.2s for 5 million rows, which seems good. However, when I read the data, the read operation clocks at 59s.

import arctic
import pandas as pd

arctic_host  = 'localhost:27017'
test_library_name = 'dev_speed_testing_library'
test_store = arctic.Arctic(arctic_host)
test_store.delete_library(test_library_name)
test_store.initialize_library(test_library_name, 'TickStoreV3')

test_library = test_store[test_library_name]
test_library._chunk_size = 1000000
test_library.list_symbols()

[]

from numpy.random import default_rng
data_length = 5000000
sample_index = pd.date_range(start='1990-01-01', periods=data_length, freq='1ms', tz='UTC')
rng = default_rng()
sample_data = rng.standard_normal(data_length)
sample_data = sample_data * sample_data #get rid of negative #s
test_df=pd.DataFrame(sample_data, sample_index, columns=['price'])
test_df.dtypes

price float64 dtype: object

%%time #don't use this if you're not running in jupyter
test_library.write('testsymbol', test_df)

CPU times: user 1.16 s, sys: 64.1 ms, total: 1.22 s Wall time: 1.26 s

%%time #don't use this if you're not running jupyter
tmp = test_library.read('testsymbol')

CPU times: user 59.3 s, sys: 3.14 s, total: 1min 2s Wall time: 59.7 s

On the read operation, the process seems to be cpu bound, with a single python thread pegged at 100%.

Not sure if I'm missing something obvious here, like using the wrong data types or something, but writes that are that many multiples faster than reads seems odd.

jeffneuen commented 3 years ago

I also ran a profiler on a read from tickstore, and below are a few relevant lines... Is there a different type that I should be saving the datetime index in?

ncalls tottime percall cumtime percall filename:lineno(function)

    1    0.585    0.585  135.354  135.354 <string>:1(<module>)
    1    0.000    0.000  122.954  122.954 datetimes.py:259(_convert_listlike_datetimes)
    1    0.000    0.000  122.955  122.955 datetimes.py:605(to_datetime)
    1    0.205    0.205  134.769  134.769 tickstore.py:265(read)
    1    0.000    0.000  135.354  135.354 {built-in method builtins.exec}
    1  122.954  122.954  122.954  122.954 {pandas._libs.tslib.array_with_unit_to_datetime}

jeffneuen commented 3 years ago

Is there any other information I could provide that would help to better describe this issue?

It is still a problem for me. Thanks!

jeffneuen commented 3 years ago

I did more testing on this, and the problem disappears when running pandas .25.3 It also disappears if you change the freq of the the sample index generator to '1ns' (although the datetime values returned by TickStore will be incorrect if you feed it the nanosecond level data via the write method).

It looks like pandas is handing the to_datetime call on line 388 of tickstore.py:

        index = pd.to_datetime(np.concatenate(rtn[INDEX]), utc=True, unit='ms')

differently enough in v .25.3 vs the 1.x branch that the new version of pandas is spending a lot of time on pandas._libs.tslib.array_with_unit_to_datetime, as shown by the profiler, where the old version is not.

crazy25000 commented 3 years ago

I just installed and setup Arctic and noticed the slow read performance with the Tickstore + Pandas 1.2.3. However, reads are really fast with VersionStore and I thought it would be the opposite :sweat_smile:

jeffneuen commented 3 years ago

Yes, this makes me wonder if man financial (the creator and maintainer of this package) is really still using the old pandas .25 branch internally, or if nobody there is using tickstore, otherwise certainly someone else would have discovered this.

crazy25000 commented 3 years ago

It's unfortunate, would've been nice to use. I also tested https://github.com/alpacahq/marketstore and it performs well. Would recommend trying it.

I've been testing different libraries to determine which one to keep, continue using, and if it's outdated and unmaintained, update it myself. Have you tried other libraries?

jeffneuen commented 3 years ago

@crazy25000 thanks for the tip! Happy to continue this discussion, but I don't want to clutter up the github issue with it. My email is on my profile if you'd like to chat datastores further!

jeffneuen commented 3 years ago

This is still an outstanding issue for me, if there is anything else I can provide to help clarify this issue, please let me know.

bmoscon commented 3 years ago

I'd bet if you downgrade pandas it will work better, this library isnt extensively used or tested on very recent pandas releases and there have been cases in the past where behavior changed in pandas (for the worse) and made trivial operations in arctic take incredibly long (i.e. 5 ms to 30 seconds)

jeffneuen commented 3 years ago

@bmoscon you are correct, if I use the pandas .25 branch, the problem is solved. However, this creates pretty serious workflow issues. If a user wants to pull data with .25, but then work with the data using a current 1.x version of pandas, you need two different venvs, and end up storing the data in some kind of intermediate layer, unless I"m missing an obvious and simpler workaround

It's possible, just seems to defeat a lot of the benefit of arctic if I am dumping eveything into parquet files with code running .25 and then loading the parquet files with a 1.x pandas to do the work. The most recent version of .25 pandas is from Oct 2019 -- a little stale at this point.

But, thank you for the reply, the point is taken that perhaps arctic just doesn't have complete support for 1.x pandas yet.

vargaspazdaniel commented 3 years ago

I'm having issues with the read speed with Tickstore too. Only around 205k rows takes around 1 min, while writing the data is working perfectly and without issues. Any way to read tick data (which usuallys have thousands and thousands rows) getting the data faster? Maybe using dask, modin or another pandas version with higher speed.

JunyueLiu commented 3 years ago

My solution is to replace line 338 to index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True, unit='ms')

jeffneuen commented 3 years ago

@JunyueLiu I tweaked your suggestion just a little bit to:

index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True)

and now the reads are back to a normal speed, about 6.5M rows/sec.

@JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release.

JunyueLiu commented 3 years ago

@JunyueLiu I tweaked your suggestion just a little bit to:
index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True)
and now the reads are back to a normal speed, about 6.5M rows/sec.

@JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release.

Feel free to submit the PR.

CmpCtrl commented 3 years ago

I found this issue as well, and found that it can be quicker still by omitting the pd.to_datetime altogether. index = (np.concatenate(rtn[INDEX])).astype("datetime64[ms]") However, this solution also requires a second change to where the timezone is converted, line 359. The following worked for me. rtn.index = rtn.index.tz_localize(dt.now().astimezone().tzinfo)

jeffneuen commented 3 years ago

@CmpCtrl I tried your solution, and did indeed did get about 50% faster reads, about 9.5M rows/sec. Thanks!

CmpCtrl commented 3 years ago

I started a branch to work on a couple other things as well, branch. The mktz() seems really slow, my first call to get the max or min date from a symbol took ~0.6 seconds and it seemed like most of that was in finding the local timezone. I also brought in the fixes from #887 so i could get back to the latest python and pandas versions. I haven't done much testing and i am only using a small portion of the functionality so i'm not sure how relevant these changes are to others.

jeffneuen commented 3 years ago

Thanks, I am checking out your branch, those functions are useful to me! I need to be on the 1.x version of pandas for other reasons, and min and max date are also useful to me. I hope that at some point this project will be able to standardize on the more recent versions of python and pandas, but my feeling is that the main corporate owner of the project probably has their own internal versions that they use, and that's what it's being maintained for.

burrowsa commented 2 years ago

I'm also seeing this problem. Picking up the fix from @jeffneuen 's repo fixed it for me.

man-group / arctic