quantopian / zipline

Zipline, a Pythonic Algorithmic Trading Library
https://www.zipline.io
Apache License 2.0
17.49k stars 4.71k forks source link

Docs for BcolzDailyBarWriter should indicate that data can't have gaps #2195

Open veeenu opened 6 years ago

veeenu commented 6 years ago

Dear Zipline Maintainers,

Before I tell you about my issue, let me describe my environment:

Environment

Now that you know a little about me, let me tell you about the issue I am having:

Description of Issue

Here is how you can reproduce this issue on your machine:

Reproduction Steps

This is the ingest function I built:

  # "API" and "assets" are defined globally
  def ingest(environ, asset_db_writer, minute_bar_writer, daily_bar_writer, adjustment_writer, calendar, start_session, end_session, cache, show_progress, output_dir):

    symbols = sorted([a['ticker'] for a in assets])
    dtype = [('start_date', 'datetime64[ns]'),
              ('end_date', 'datetime64[ns]'),
              ('auto_close_date', 'datetime64[ns]'),
              ('symbol', 'object')]
    metadata = pd.DataFrame(np.empty(len(symbols), dtype=dtype))

    def write_fn():
      for idx, asset in enumerate(assets):
        aid, tkr = asset['id'], asset['ticker']
        print(tkr)
        # Replace the following line with, say, a simple pd.read_csv() of a timeseries with legit session gaps, as API() connects to my service and can't be used to reproduce
        ts = requests.get(API('asset/{:s}/prices/eod'.format(aid))).json()
        df = pd.DataFrame(ts).set_index('time')[['open', 'high', 'low', 'close', 'volume']]

        start_date = pd.to_datetime(df.index[0])
        end_date = pd.to_datetime(df.index[-1])
        metadata.iloc[idx] = start_date, end_date, end_date + pd.Timedelta(days=1), tkr
        yield idx, df

    daily_bar_writer.write(write_fn(), show_progress=True)
    asset_db_writer.write(equities=metadata)

and this is the register() call:

register(
  'mybundle',
  ingest,
  calendar_name='NYSE'
)

What steps have you taken to resolve this already?

I tried looking into Zipline's source code and through the issues/pull requests to find out whether I made a mistake in my implementation but couldn't find anything. Thanks for your help, let me know if you need further information.

Sincerely, Andrea Venuta

yankees714 commented 6 years ago

Hi @veeenu - apologies for the confusion. I looked into this, and the description in #1778 is misleading. Looks like we may have started with a different intention, but in the change we actually merged, the daily bar writer expects no gaps in the data (i.e. they won't be filled).

If you expect gaps, you probably just want to reindex against the expected trading sessions to fill with nans. You should be able to do something like:

from zipline.utils.calendars import get_calendar

# Ensure the df is indexed by UTC timestamps
df = df.set_index(df.index.to_datetime().tz_localize('UTC'))

# Get all expected trading sessions in this range and reindex.
sessions = get_calendar('NYSE').sessions_in_range(start_date, end_date)
df = df.reindex(sessions)
veeenu commented 6 years ago

No problem! I will try reindexing the dataframe. I suggest adding this bit of information to the documentation as I believe time series with gaps are a frequent use case, at least in the context of custom bundles.

Thanks for your patience, and keep up the great work! :)

veeenu commented 6 years ago

After fixing the above, I incurred in another issue which I can't solve. It seems now that the data is correctly ingested, but I get an error at the time of executing the algorithm. This is my new ingest function:

  def ingest(environ, asset_db_writer, minute_bar_writer, daily_bar_writer, adjustment_writer, calendar, start_session, end_session, cache, show_progress, output_dir):

    differences = dict()
    symbols = sorted([a['ticker'] for a in assets])
    dtype = [('start_date', 'datetime64[ns]'),
              ('end_date', 'datetime64[ns]'),
              ('auto_close_date', 'datetime64[ns]'),
              ('symbol', 'object')]
    metadata = pd.DataFrame(np.empty(len(symbols), dtype=dtype))

    def write_fn():
      for idx, asset in enumerate(assets):
        aid, tkr = asset['id'], asset['ticker']
        ts = requests.get(API('asset/{:s}/prices/eod'.format(aid))).json()
        df = pd.DataFrame(ts).set_index('time')[['open', 'high', 'low', 'close', 'volume']]
        df.index = pd.to_datetime(df.index).tz_localize('UTC')

        start_date = df.index[0]
        end_date = df.index[-1]

        metadata.iloc[idx] = start_date.tz_convert(None), end_date.tz_convert(None), (end_date + pd.Timedelta(days=1)).tz_convert(None), tkr

        sess = calendar.sessions_in_range(start_date, end_date)
        dif = sess.difference(df.index)

        if len(dif) > 0:
          differences[tkr] = dif

        df = df.reindex(sess)

        yield idx, df

    daily_bar_writer.write(write_fn(), show_progress=True)
    asset_db_writer.write(equities=metadata)
    adjustment_writer.write(
      dividends=pd.DataFrame(columns=['sid', 'amount', 'ex_date', 'record_date', 'declared_date', 'pay_date']),
      splits=pd.DataFrame(columns=['sid', 'ratio', 'effective_date']))
    metadata['exchange'] = 'REINDEER'
    for k, v in differences.items():
      print(k, ' -> ', v) # list gaps

I then wrote a dummy algorithm, which works as intended with Quandl data:

from zipline.api import order, record, symbol

def initialize(context):
    print(context)

def handle_data(context, data):
    print(data)

But, as soon as I switch to my bundle, I get:

$ zipline run -b spx-reindeer -f algo.py -s 2018-01-01 -e 2018-02-01
[2018-05-30 08:51:18.524235] WARNING: Loader: Refusing to download new benchmark data because a download succeeded at 2018-05-30 07:59:42.414161+00:00.
[2018-05-30 08:51:18.549750] WARNING: Loader: Refusing to download new treasury data because a download succeeded at 2018-05-30 07:59:47.815400+00:00.
Traceback (most recent call last):
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\Scripts\zipline-script.py", line 11, in <module>
    load_entry_point('zipline==1.2.0', 'console_scripts', 'zipline')()
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\click\core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\click\core.py", line 697, in main
    rv = self.invoke(ctx)
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\click\core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\click\core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\click\core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\zipline\__main__.py", line 98, in _
    return f(*args, **kwargs)
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\zipline\__main__.py", line 259, in run
    environ=os.environ,
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\zipline\utils\run_algo.py", line 208, in _run
    overwrite_sim_params=False,
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\zipline\algorithm.py", line 642, in run
    self.trading_environment.asset_finder.sids
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\zipline\assets\assets.py", line 494, in retrieve_all
    update_hits(self.retrieve_equities(type_to_assets.pop('equity', ())))
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\zipline\assets\assets.py", line 528, in retrieve_equities
    return self._retrieve_assets(sids, self.equities, Equity)
  File "C:\Users\avenuta\AppData\Local\Continuum\Anaconda3\envs\zipline\lib\site-packages\zipline\assets\assets.py", line 681, in _retrieve_assets
    asset = asset_type(**filter_kwargs(row))
  File "zipline\assets\_assets.pyx", line 59, in zipline.assets._assets.Asset.__init__ (zipline/assets\_assets.c:1857)
TypeError: __init__() takes at least 2 positional arguments (1 given)

I'm not sure how to debug this situation as it looks pretty deep in the code, and related to the way the bundle is constructed. I also tried to add empty adjustment dataframes but at this point I can find no significant difference in the calls between my ingest function and the csvdir bundle (which I used as a guideline for my function). Do you have any suggestions?

Thanks!

yankees714 commented 6 years ago

It looks like you're setting metadata['exchange'] after passing metadata into asset_db_writer.write(), so my guess is that the data being written is missing an exchange column. I'd try setting that beforehand.

For the initial report, sounds like the only issue is some details missing in the documentation, so I'm going to update to title of this reflect that. Feel free to open another issue if you run into anything else!

cemal95 commented 4 years ago

How would you solve for extra sessions? I have a similar problem, but iI have 2 errors. 1 one missing date, and the other shows extra sessions. How can I can ingest data and ignoring these extra sessions? and ingest such that it takes all the available data?