quantopian / zipline

Zipline, a Pythonic Algorithmic Trading Library
https://www.zipline.io
Apache License 2.0
17.69k stars 4.73k forks source link

KeyError in ingesting minute-frequency csv data #2366

Open josephcclin opened 5 years ago

josephcclin commented 5 years ago

Dear Zipline Maintainers,

Before I tell you about my issue, let me describe my environment:

Environment

* Operating System: Windows 10 * Python Version: 3.5.5 * Python Bitness: 64 * How did you install Zipline: conda * Python packages: # # Name Version Build Channel alembic 0.7.7 py35_0 Quantopian asn1crypto 0.24.0 py35_0 bcolz 0.12.1 np114py35_0 Quantopian blas 1.0 mkl blosc 1.14.4 he51fdeb_0 bottleneck 1.2.1 py35h452e1ab_1 bzip2 1.0.6 hfa6e2cd_5 ca-certificates 2018.03.07 0 certifi 2018.8.24 py35_1 cffi 1.11.5 py35h74b6da3_1 chardet 3.0.4 chardet 3.0.4 py35_1 click 6.7 py35h10df73f_0 contextlib2 0.5.5 py35h0a97e54_0 cryptography 2.3.1 py35h74b6da3_0 cyordereddict 0.2.2 py35_0 Quantopian Cython 0.29 cython 0.28.5 py35h6538335_0 decorator 4.3.0 py35_0 empyrical 0.5.0 py35_0 Quantopian hdf5 1.10.2 hac2f561_1 icc_rt 2017.0.4 h97af966_0 idna 2.7 py35_0 idna 2.7 intel-openmp 2019.0 118 intervaltree 2.1.0 py35_0 Quantopian libiconv 1.15 h1df5818_7 libxml2 2.9.8 hadb2253_1 libxslt 1.1.32 hf6f1972_0 Logbook 1.4.1 logbook 0.12.5 py35_0 Quantopian lru-dict 1.1.4 py35_0 Quantopian lxml 4.2.5 py35hef2cd61_0 lzo 2.10 h6df0209_2 mako 1.0.7 py35_0 markupsafe 1.0 py35hfa6e2cd_1 mkl 2019.0 118 mkl_fft 1.0.6 py35hdbbee80_0 mkl_random 1.0.1 py35h77b88f5_1 multipledispatch 0.6.0 py35_0 networkx 1.11 py35_1 numexpr 2.6.1 np114py35_0 Quantopian numpy 1.15.4 numpy 1.14.6 py35hc27ee41_4 numpy-base 1.14.6 py35h8128ebf_4 openssl 1.0.2p hfa6e2cd_0 pandas 0.22.0 pandas 0.22.0 py35h6538335_0 pandas-datareader 0.7.0 pandas-datareader 0.6.0 py35_0 patsy 0.5.0 py35_0 patsy 0.5.1 pip 18.1 pip 10.0.1 py35_0 pycparser 2.19 py35_0 pyopenssl 18.0.0 py35_0 pysocks 1.6.8 py35_0 pytables 3.4.4 py35he6f6034_0 python 3.5.5 h0c2934d_2 python-dateutil 2.7.3 py35_0 python-dateutil 2.7.5 pytz 2018.5 py35_0 pytz 2018.7 requests 2.19.1 py35_0 requests 2.20.1 requests-file 1.4.3 requests-file 1.4.3 py35_0 requests-ftp 0.3.1 py35_0 scipy 1.1.0 scipy 1.1.0 py35hc28095f_0 setuptools 40.2.0 py35_0 six 1.11.0 py35_1 six 1.11.0 snappy 1.1.7 h777316e_3 sortedcontainers 1.4.4 py35_0 Quantopian sqlalchemy 1.2.11 py35hfa6e2cd_0 statsmodels 0.9.0 py35h452e1ab_0 statsmodels 0.9.0 toolz 0.9.0 py35_0 trading-calendars 1.0.1 py35_0 Quantopian urllib3 1.24.1 urllib3 1.23 py35_0 vc 14.1 h0510ff6_4 vs2015_runtime 14.15.26706 h3a45250_0 wheel 0.31.1 py35_0 win_inet_pton 1.0.1 py35_1 wincertstore 0.2 py35hfebbdb8_0 wrapt 1.10.11 wrapt 1.10.11 py35hfa6e2cd_2 zipline 1.3.0 np114py35_0 Quantopian zlib 1.2.11 h8395fce_2

Now that you know a little about me, let me tell you about the issue I am having:

Description of Issue

Here is how you can reproduce this issue on your machine:

Reproduction Steps

  1. Download the csv data (https://www.dropbox.com/s/9enxomhizk86dzk/sample.csv?dl=0)
  2. edit the ..zipline\extension.py as below:
    
    import pandas as pd

from zipline.data.bundles import register from zipline.data.bundles.csvdir import csvdir_equities

start_session = pd.Timestamp('2009-08-24', tz='UTC') end_session = pd.Timestamp('2010-08-24', tz='UTC')

register( 'futures-bundle-min', csvdir_equities( ['minute'], 'C:\Users\user\zipTest', ), calendar_name='CME', start_session=start_session, end_session=end_session )


3.  Place "sample.csv" at C:\Users\user\zipTest\minute\

4. run "zipline ingest -b futures-bundle-min"
then errors popped out as below: 

Loading custom pricing data:   [####################################]  100% | sample: sid 0
Merging minute equity files:  [------------------------------------]  0
Traceback (most recent call last):
  File "pandas/_libs/index.pyx", line 449, in pandas._libs.index.DatetimeEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 817, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1282579200000000000

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 421, in pandas._libs.index.DatetimeEngine.get_loc
  File "pandas/_libs/index.pyx", line 451, in pandas._libs.index.DatetimeEngine.get_loc
KeyError: Timestamp('2010-08-23 16:00:00+0000', tz='UTC')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pandas/_libs/index.pyx", line 449, in pandas._libs.index.DatetimeEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 817, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1282579200000000000

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\user\Anaconda3\envs\zipTest\Scripts\zipline-script.py", line 11, in <module>
    load_entry_point('zipline==1.3.0', 'console_scripts', 'zipline')()
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\click\core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\click\core.py", line 697, in main
    rv = self.invoke(ctx)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\click\core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\click\core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\click\core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\zipline\__main__.py", line 348, in ingest
    show_progress,
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\zipline\data\bundles\core.py", line 451, in ingest
    pth.data_path([name, timestr], environ=environ),
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\zipline\data\bundles\csvdir.py", line 94, in ingest
    self.csvdir)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\zipline\data\bundles\csvdir.py", line 156, in csvdir_bundle
    show_progress=show_progress)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\zipline\data\minute_bars.py", line 697, in write
    write_sid(*e, invalid_data_behavior=invalid_data_behavior)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\zipline\data\minute_bars.py", line 730, in write_sid
    self._write_cols(sid, dts, cols, invalid_data_behavior)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\zipline\data\minute_bars.py", line 810, in _write_cols
    latest_min_count = all_minutes.get_loc(last_minute_to_write)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\pandas\core\indexes\datetimes.py", line 1426, in get_loc
    return Index.get_loc(self, key, method, tolerance)
  File "C:\Users\user\Anaconda3\envs\zipTest\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 421, in pandas._libs.index.DatetimeEngine.get_loc
  File "pandas/_libs/index.pyx", line 451, in pandas._libs.index.DatetimeEngine.get_loc
**KeyError: Timestamp('2010-08-23 16:00:00+0000', tz='UTC')**

...

## What steps have you taken to resolve this already?

...

# Anything else?

...

Sincerely,
Joseph
josephcclin commented 5 years ago

As I changed the trading-calendar to 'NYSE', the issue was solved. However, I am still puzzled because basically the times of data (in minute-frequency) are contained in the range of CME somehow.

robinsonOdhiambo commented 5 years ago

Hi, got the same issue ingesting minute data for 24/7 calendar

txu2014 commented 5 years ago

can you try the following to configure minutes_per_day, which default to 390 (for stocks.) minutes_per_day=1440, calendar_name='CME', start_session=None, end_session=None

netshade commented 4 years ago

I'm not confident this is the best fix but it seemed like it had something to do with the calculation of the last possible index in the range. I fixed it in my installation by changing the following line in:

zipline/data/minute_bars.py from:

latest_min_count = all_minutes.get_loc(last_minute_to_write)

to

latest_min_count = all_minutes.get_loc(last_minute_to_write, 'backfill')

to cause it to find the value at the next possible minute after the minute it's looking for, if the minute its looking for is not found.

x777 commented 4 years ago

Similar issue:

`Traceback (most recent call last): File "pandas/_libs/tslib.pyx", line 1702, in pandas._libs.tslib.convert_str_to_tsobject File "pandas/_libs/src/datetime.pxd", line 119, in datetime._string_to_dts ValueError: Error parsing datetime string "ASTC.csv" at position 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "pandas/_libs/tslib.pyx", line 1732, in pandas._libs.tslib.convert_str_to_tsobject File "pandas/_libs/tslibs/parsing.pyx", line 99, in pandas._libs.tslibs.parsing.parse_datetime_string File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/dateutil/parser/_parser.py", line 1374, in parse return DEFAULTPARSER.parse(timestr, **kwargs) File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/dateutil/parser/_parser.py", line 649, in parse raise ParserError("Unknown string format: %s", timestr) dateutil.parser._parser.ParserError: Unknown string format: ASTC.csv

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/x777/anaconda3/envs/env_zipline/bin/zipline", line 11, in load_entry_point('zipline==1.3.0', 'console_scripts', 'zipline')() File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/click/core.py", line 610, in invoke return callback(args, **kwargs) File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/zipline/main.py", line 404, in bundles map(text_type, bundles_module.ingestions_for_bundle(bundle)) File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/zipline/data/bundles/core.py", line 131, in ingestions_for_bundle reverse=True, File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/zipline/data/bundles/core.py", line 130, in if not pth.hidden(ing)), File "/home/x777/anaconda3/envs/env_zipline/lib/python3.5/site-packages/zipline/data/bundles/core.py", line 123, in from_bundle_ingest_dirname return pd.Timestamp(cs.replace(';', ':')) File "pandas/_libs/tslib.pyx", line 390, in pandas._libs.tslib.Timestamp.new File "pandas/_libs/tslib.pyx", line 1549, in pandas._libs.tslib.convert_to_tsobject File "pandas/_libs/tslib.pyx", line 1735, in pandas._libs.tslib.convert_str_to_tsobject ValueError: could not convert string to Timestamp`

File format (example part):

date | open | high | low | close | volume 2020-03-25 08:00 | 1.2 | 1.88 | 1.2 | 1.88 | 1229 2020-03-25 08:01 | 2.25 | 2.25 | 2.25 | 2.25 | 2 2020-03-25 08:03 | 2.25 | 2.25 | 2.25 | 2.25 | 198 2020-03-25 08:04 | 2 | 2.3899 | 2 | 2.32 | 964 2020-03-25 08:05 | 2.25 | 2.61 | 2.25 | 2.5 | 3997 2020-03-25 08:06 | 2.4499 | 2.48 | 2.44 | 2.48 | 1100 2020-03-25 08:07 | 2.48 | 2.4899 | 2.44 | 2.44 | 1727 2020-03-25 08:09 | 2.3 | 2.4012 | 2.3 | 2.39 | 1520 2020-03-25 08:10 | 2.38 | 2.38 | 2.1 | 2.1 | 1121 2020-03-25 08:11 | 2.05 | 2.1 | 1.78 | 1.78 | 2217 2020-03-25 08:12 | 1.78 | 1.88 | 1.7 | 1.88 | 1408 2020-03-25 08:13 | 1.88 | 2.03 | 1.88 | 2 | 2657 2020-03-25 08:14 | 2.03 | 2.34 | 2.03 | 2.34 | 8467 2020-03-25 08:15 | 2.34 | 2.47 | 2.27 | 2.47 | 12406 2020-03-25 08:16 | 2.4 | 2.55 | 2.21 | 2.27 | 8549 2020-03-25 08:17 | 2.3 | 2.7 | 2.27 | 2.7 | 22131 2020-03-25 08:18 | 2.75 | 2.9 | 2.73 | 2.76 | 26921 2020-03-25 08:19 | 2.76 | 3.1 | 2.65 | 3.01 | 17288 2020-03-25 08:20 | 3.09 | 3.19 | 2.86 | 3.19 | 31333 2020-03-25 08:21 | 3.15 | 3.3 | 3.02 | 3.11 | 39337 2020-03-25 08:22 | 3.06 | 3.09 | 2.87 | 2.89 | 40277 2020-03-25 08:23 | 2.895 | 3.02 | 2.79 | 2.9 | 15370 2020-03-25 08:24 | 2.9 | 3.27 | 2.9 | 3.14 | 22064 2020-03-25 08:25 | 3.16 | 3.16 | 2.91 | 3 | 16245 2020-03-25 08:26 | 2.9999 | 3.08 | 2.9999 | 3 | 8341

lobobruno commented 4 years ago

I'm not confident this is the best fix but it seemed like it had something to do with the calculation of the last possible index in the range. I fixed it in my installation by changing the following line in:

zipline/data/minute_bars.py from:

latest_min_count = all_minutes.get_loc(last_minute_to_write)

to

latest_min_count = all_minutes.get_loc(last_minute_to_write, 'backfill')

to cause it to find the value at the next possible minute after the minute it's looking for, if the minute its looking for is not found.

@netshade , just be carrefull, some strategies might be affected. If you make a BUY on market_open(), you might get the price from the previous day not the open price of the current day!

netshade commented 4 years ago

Great call, thank you. :)

tstevens02127 commented 4 years ago

Oi @lobobruno , tudo bom? Do you have a working example of a ingest function for minute level data that you'd be willing to share? I've been trying to run minute-level backtests with some issues. I've got it to work now but my output has a strange quality. Even though I have minute level data:

2020-05-08 09:44:00+00:00 2020-05-08 09:45:00+00:00 2020-05-08 09:46:00+00:00

My output zeros out everything but the day, tossing the hour and minute detail out. So, for a given trading day, I've got a series of +400 lines of results that all share the same timestamp (that day's date). Is this an issue that you encountered? What part of this process could lead to this? Many thanks for your insight Output:

2020-05-08 00:00:00+00:00 2020-05-08 00:00:00+00:00 2020-05-08 00:00:00+00:00

hojatm-huma commented 4 years ago

Hi guys!

I solved my problem by setting minutes_per_day to its correct value while I'm registering my bundle to ingest, in register function. So to fix the problem you should prepare your bundle registration function with correct TradingCalender AND minutes_per_day value.

For example if you want a 24/7 hour calender you register function should be like this:


register_calendar(
    'always_open',
    AlwaysOpenCalendar(),
)

register(
    'test_bundle',
    csvdir_equities(
        ['minute'],
        'path_to_your_csv_file',
    ),
    calendar_name='always_open',
    minutes_per_day=1440,
    start_session=start_session,
    end_session=end_session
)

Check this files to see what is happening: _zipline/data/bundles/core.py line 408 zipline/data/minute_bars.py line 468 zipline/data/minutebars.py line 810

cemal95 commented 4 years ago

@h4ppysmile, would you know how to solve #2700 ?

hbtholen commented 5 months ago

I have the same Issue with the Timestamp error, when I want to use the always_open market. However, I have the error on a daily timeframe and tried several solutions like start_date = pd.Timestamp('2022-07-29',).tz_localize('UTC') start_date = pd.Timestamp('2022-07-29')

It seems that this problem occurs and that the data is 20 years before the day I ingested the data bundle via zipline.

This is the Error:

KeyError Traceback (most recent call last) File pandas_libs\index.pyx:444, in pandas._libs.index.DatetimeEngine.get_loc()

File pandas_libs\hashtable_class_helper.pxi:1625, in pandas._libs.hashtable.Int64HashTable.get_item()

File pandas_libs\hashtable_class_helper.pxi:1632, in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 1061596800000000000

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last) File c:\Users\henry\miniconda3\envs\ml4t\lib\site-packages\pandas\core\indexes\base.py:3081, in Index.get_loc(self, key, method, tolerance) 3080 try: -> 3081 return self._engine.get_loc(casted_key) 3082 except KeyError as err:

File pandas_libs\index.pyx:413, in pandas._libs.index.DatetimeEngine.get_loc()

File pandas_libs\index.pyx:446, in pandas._libs.index.DatetimeEngine.get_loc()

KeyError: Timestamp('2003-08-23 00:00:00+0000', tz='UTC')

The above exception was the direct cause of the following exception: ... 686 return Index.get_loc(self, key, method, tolerance) 687 except KeyError as err: --> 688 raise KeyError(orig_key) from err

KeyError: Timestamp('2003-08-23 00:00:00+0000', tz='UTC')

RichardDale commented 5 months ago

This is probably due to a long-standing hard-coded limit in exchange_calendars.

This might manual patch works around the issue (well, at least back to 1970): https://pypi.org/project/zipline-norgatedata/#patch-to-allow-backtesting-before-20-years-ago

Zipline itself in calendar_utils is also hardcoded to 1990. See this patch too: https://pypi.org/project/zipline-norgatedata/#additional-patch-to-allow-backtesting-before-1990