Open jeffneuen opened 3 years ago
I also ran a profiler on a read from tickstore, and below are a few relevant lines... Is there a different type that I should be saving the datetime index in?
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.585 0.585 135.354 135.354 <string>:1(<module>)
1 0.000 0.000 122.954 122.954 datetimes.py:259(_convert_listlike_datetimes)
1 0.000 0.000 122.955 122.955 datetimes.py:605(to_datetime)
1 0.205 0.205 134.769 134.769 tickstore.py:265(read)
1 0.000 0.000 135.354 135.354 {built-in method builtins.exec}
1 122.954 122.954 122.954 122.954 {pandas._libs.tslib.array_with_unit_to_datetime}
Is there any other information I could provide that would help to better describe this issue?
It is still a problem for me. Thanks!
I did more testing on this, and the problem disappears when running pandas .25.3 It also disappears if you change the freq of the the sample index generator to '1ns' (although the datetime values returned by TickStore will be incorrect if you feed it the nanosecond level data via the write method).
It looks like pandas is handing the to_datetime call on line 388 of tickstore.py:
index = pd.to_datetime(np.concatenate(rtn[INDEX]), utc=True, unit='ms')
differently enough in v .25.3 vs the 1.x branch that the new version of pandas is spending a lot of time on pandas._libs.tslib.array_with_unit_to_datetime, as shown by the profiler, where the old version is not.
I just installed and setup Arctic and noticed the slow read performance with the Tickstore
+ Pandas 1.2.3
. However, reads are really fast with VersionStore
and I thought it would be the opposite :sweat_smile:
Yes, this makes me wonder if man financial (the creator and maintainer of this package) is really still using the old pandas .25 branch internally, or if nobody there is using tickstore, otherwise certainly someone else would have discovered this.
It's unfortunate, would've been nice to use. I also tested https://github.com/alpacahq/marketstore and it performs well. Would recommend trying it.
I've been testing different libraries to determine which one to keep, continue using, and if it's outdated and unmaintained, update it myself. Have you tried other libraries?
@crazy25000 thanks for the tip! Happy to continue this discussion, but I don't want to clutter up the github issue with it. My email is on my profile if you'd like to chat datastores further!
This is still an outstanding issue for me, if there is anything else I can provide to help clarify this issue, please let me know.
I'd bet if you downgrade pandas it will work better, this library isnt extensively used or tested on very recent pandas releases and there have been cases in the past where behavior changed in pandas (for the worse) and made trivial operations in arctic take incredibly long (i.e. 5 ms to 30 seconds)
@bmoscon you are correct, if I use the pandas .25 branch, the problem is solved. However, this creates pretty serious workflow issues. If a user wants to pull data with .25, but then work with the data using a current 1.x version of pandas, you need two different venvs, and end up storing the data in some kind of intermediate layer, unless I"m missing an obvious and simpler workaround
It's possible, just seems to defeat a lot of the benefit of arctic if I am dumping eveything into parquet files with code running .25 and then loading the parquet files with a 1.x pandas to do the work. The most recent version of .25 pandas is from Oct 2019 -- a little stale at this point.
But, thank you for the reply, the point is taken that perhaps arctic just doesn't have complete support for 1.x pandas yet.
I'm having issues with the read speed with Tickstore too. Only around 205k rows takes around 1 min, while writing the data is working perfectly and without issues. Any way to read tick data (which usuallys have thousands and thousands rows) getting the data faster? Maybe using dask, modin or another pandas version with higher speed.
My solution is to replace line 338 to
index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True, unit='ms')
@JunyueLiu I tweaked your suggestion just a little bit to:
index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True)
and now the reads are back to a normal speed, about 6.5M rows/sec.
@JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release.
@JunyueLiu I tweaked your suggestion just a little bit to:
index = pd.to_datetime(np.concatenate(rtn[INDEX]).astype('datetime64[ms]'), utc=True)
and now the reads are back to a normal speed, about 6.5M rows/sec.
@JunyueLiu would you like to submit a PR since the fix was basically yours? If not I'll do it and credit you. I would think that this issue must be affecting a lot of people who would benefit from the fix being in a release.
Feel free to submit the PR.
I found this issue as well, and found that it can be quicker still by omitting the pd.to_datetime
altogether.
index = (np.concatenate(rtn[INDEX])).astype("datetime64[ms]")
However, this solution also requires a second change to where the timezone is converted, line 359. The following worked for me.
rtn.index = rtn.index.tz_localize(dt.now().astimezone().tzinfo)
@CmpCtrl I tried your solution, and did indeed did get about 50% faster reads, about 9.5M rows/sec. Thanks!
I started a branch to work on a couple other things as well, branch. The mktz() seems really slow, my first call to get the max or min date from a symbol took ~0.6 seconds and it seemed like most of that was in finding the local timezone. I also brought in the fixes from #887 so i could get back to the latest python and pandas versions. I haven't done much testing and i am only using a small portion of the functionality so i'm not sure how relevant these changes are to others.
Thanks, I am checking out your branch, those functions are useful to me! I need to be on the 1.x version of pandas for other reasons, and min and max date are also useful to me. I hope that at some point this project will be able to standardize on the more recent versions of python and pandas, but my feeling is that the main corporate owner of the project probably has their own internal versions that they use, and that's what it's being maintained for.
I'm also seeing this problem. Picking up the fix from @jeffneuen 's repo fixed it for me.
Arctic Version
Arctic Store
Platform and version
Ubuntu Linux 20.04, Python 3.8.8 (Anaconda), running JupyterLab Modern CPU w/ NVMe
Description of problem and/or code sample that reproduces the issue
I am experiencing very slow Tickstore reads. In my sample code below, the write operation clocks at 1.2s for 5 million rows, which seems good. However, when I read the data, the read operation clocks at 59s.
[]
price float64 dtype: object
CPU times: user 1.16 s, sys: 64.1 ms, total: 1.22 s Wall time: 1.26 s
CPU times: user 59.3 s, sys: 3.14 s, total: 1min 2s Wall time: 59.7 s
On the read operation, the process seems to be cpu bound, with a single python thread pegged at 100%.
Not sure if I'm missing something obvious here, like using the wrong data types or something, but writes that are that many multiples faster than reads seems odd.