Some algos produce an IndexError with history

twiecki commented 9 years ago

There are a couple of emails on the zipline mailing list that are reporting problems with history. I am able to reproduce on master and get this:

Traceback (most recent call last):
  File "algo.py", line 143, in <module>
    perf_manual = algo_obj.run(data)
  File "/home/wiecki/working/projects/quant/zipline/zipline/algorithm.py", line 476, in run
    for perf in self.gen:
  File "/home/wiecki/working/projects/quant/zipline/zipline/gens/tradesimulation.py", line 129, in transform
    self.algo.instant_fill,
  File "/home/wiecki/working/projects/quant/zipline/zipline/gens/tradesimulation.py", line 234, in _process_snapshot
    new_orders = self._call_handle_data()
  File "/home/wiecki/working/projects/quant/zipline/zipline/gens/tradesimulation.py", line 258, in _call_handle_data
    self.simulation_dt,
  File "/home/wiecki/working/projects/quant/zipline/zipline/utils/events.py", line 194, in handle_data
    event.handle_data(context, data, dt)
  File "/home/wiecki/working/projects/quant/zipline/zipline/utils/events.py", line 212, in handle_data
    self.callback(context, data)
  File "/home/wiecki/working/projects/quant/zipline/zipline/algorithm.py", line 271, in handle_data
    self._handle_data(self, data)
  File "algo.py", line 105, in handle_data
    price_history = history(bar_count=30, frequency='1d', field='price')
  File "/home/wiecki/working/projects/quant/zipline/zipline/utils/api_support.py", line 51, in wrapped
    return getattr(get_algo_instance(), f.__name__)(*args, **kwargs)
  File "/home/wiecki/working/projects/quant/zipline/zipline/algorithm.py", line 996, in history
    return self.history_container.get_history(history_spec, self.datetime)
  File "/home/wiecki/working/projects/quant/zipline/zipline/history/history_container.py", line 837, in get_history
    self.last_known_prior_values,
  File "/home/wiecki/working/projects/quant/zipline/zipline/history/history_container.py", line 46, in ffill_buffer_from_prior_values
    nan_sids = buffer_frame.iloc[0].isnull()
  File "/home/wiecki/envs/zipline_p14/local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1194, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/home/wiecki/envs/zipline_p14/local/lib/python2.7/site-packages/pandas/core/indexing.py", line 1464, in _getitem_axis
    return self._get_loc(key, axis=axis)
  File "/home/wiecki/envs/zipline_p14/local/lib/python2.7/site-packages/pandas/core/indexing.py", line 91, in _get_loc
    return self.obj._ixs(key, axis=axis)
  File "/home/wiecki/envs/zipline_p14/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1667, in _ixs
    label = self.index[i]
  File "/home/wiecki/envs/zipline_p14/local/lib/python2.7/site-packages/pandas/tseries/index.py", line 1264, in __getitem__
    val = getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

ssanderson commented 9 years ago

@twiecki can you post an example of an algo that triggers this?

twiecki commented 9 years ago

https://groups.google.com/forum/#!topic/zipline/1RiEgZEXyI0, the third post.

twiecki commented 9 years ago

Posting slightly modified script here (still requires the csv files):

import datetime
import pytz
import numpy as np
import pandas as pd

#import sklearn
#import scikits
import matplotlib.pyplot as plt
import statsmodels.api as sm  
import pandas.io.data
from scipy import stats

import zipline as zp
from zipline import TradingAlgorithm  
from zipline.api import *
from zipline.finance.slippage import FixedSlippage
from zipline.transforms import batch_transform
from zipline.api import order_target, record, symbol, history, add_history

import math
from pytz import timezone
from zipline.utils import tradingcalendar as calendar

df1 = pd.read_csv("GBPUSD1440.csv", names=['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume'], 
                     index_col='Date_Time', parse_dates=[[0, 1]])
df2 = pd.read_csv("EURUSD1440.csv", names=['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume'], 
                     index_col='Date_Time', parse_dates=[[0, 1]])
df1['open'] = df1['Open']
df1['high'] = df1['High']
df1['low'] = df1['Low']
df1['close'] = df1['Close']
df1['volume'] = df1['Volume']
df1['price'] = df1['Close']
df1 = df1.dropna()
df1 = df1.drop('Open', 1)
df1 = df1.drop('High', 1)
df1 = df1.drop('Close', 1)
df1 = df1.drop('Volume', 1)
df1 = df1.drop('Low', 1)

df1['open1'] = df2['Open']
df1['high1'] = df2['High']
df1['low1'] = df2['Low']
df1['close1'] = df2['Close']
df1['volume1'] = df2['Volume']
df1['price1'] = df2['Close']
df1 = df1.dropna()

df2['open'] = df1['open1']
df2['high'] = df1['high1']
df2['low'] = df1['low1']
df2['close'] = df1['close1']
df2['volume'] = df1['volume1']
df2['price'] = df1['price1']
df2 = df2.dropna()
df2 = df2.drop('Open', 1)
df2 = df2.drop('High', 1)
df2 = df2.drop('Close', 1)
df2 = df2.drop('Volume', 1)
df2 = df2.drop('Low', 1)

df1 = df1.drop('open1', 1)
df1 = df1.drop('high1', 1)
df1 = df1.drop('low1', 1)
df1 = df1.drop('close1', 1)
df1 = df1.drop('volume1', 1)
df1 = df1.drop('price1', 1)

df1 = df1.tz_localize('UTC')
#df1 = df1.tz_convert('US/Eastern')
df2 = df2.tz_localize('UTC')
#df2 = df2.tz_convert('US/Eastern')

data = pd.Panel({'GBP' : df1, 'EUR' : df2})
data = data.dropna()
data
#plt.gcf().set_size_inches(16, 12)

#context.gld = symbol('GBP')
#context.iau = symbol('EUR')
#add_history(50, '1d', 'price')

def initialize(context):
    context.sid1 = symbol('GBP')    #Chevron
    context.sid2 = symbol('EUR')     #Exxon Mobil
    context.lookbackPeriod = 30
    context.channelWidth = 2.0
    context.bet_size = 200
    add_history(30, '1d', 'price')
    context.i = 0

# Will be called on every trade event for the securities you specify.  
def handle_data(context, data):
    context.i += 1
    if context.i < 40:
        return

    price_history = history(bar_count=30, frequency='1d', field='price')
    ratio = price_history[context.sid1] / price_history[context.sid2]
    ratioSTD = ratio.std()
    ratioMean = ratio.mean()
    upper = ratioMean + context.channelWidth * ratioSTD
    lower = ratioMean - context.channelWidth * ratioSTD
    ratioToday = data[context.sid1].price/data[context.sid2].price

    record(upper=upper, middle=ratioMean, lower=lower, ratio=ratioToday)

    if ratioToday > upper:
        x = price_history[context.sid1]
        y = price_history[context.sid2]
        theta = sm.OLS(y, x).fit().params[context.sid1]
        long_bet = data[context.sid2].price * context.bet_size
        short_bet = -1*theta * data[context.sid1].price * context.bet_size
        # long sid2
        order_target_value(context.sid2, long_bet)
        # short sid1
        order_target_value(context.sid1, short_bet)
    elif ratioToday < lower:
        x = price_history[context.sid2]
        y = price_history[context.sid1]
        theta = sm.OLS(y, x).fit().params[context.sid2]
        long_bet = data[context.sid1].price * context.bet_size
        short_bet = -1*theta * data[context.sid2].price * context.bet_size
        # long sid1
        order_target_value(context.sid1, long_bet)
        # short sid2
        order_target_value(context.sid2, short_bet)
        context.inPosition = True

algo_obj = TradingAlgorithm(initialize=initialize, 
                            handle_data=handle_data)
# Run algorithm
perf_manual = algo_obj.run(data)

#ax1 = plt.subplot(311)
#perf_manual.portfolio_value.plot(ax=ax1)
#ax1.set_ylabel('portfolio value in $')

llllllllll commented 9 years ago

I ran this locally and have only done some light poking but wanted to post my initial thoughts. I think this is an issue where we are making assumptions that there will not be missing days in daily mode, and instead these will be rows full of nans. The data source that is given has a couple of days missing here and there. More investigation needs to go into the interaction between the dataframe source and history to be certain.

shlomoa commented 9 years ago

The algorithm is the classic dual_moving_avg with a twist: 0) copy these files to ~/.zipline/cache/ https://drive.google.com/file/d/0B-dfTbup1rFdTWM2cHo1YmFqS00/view?usp=sharing 1) load this data start = datetime.datetime(2011, 9, 9, 9, 30, 0, 0, pytz.utc) end = datetime.datetime(2011, 11, 21, 15, 59, 0, 0, pytz.utc) 2) replace all occurrences of '1d' with '1m' 3) Initialize the algorithm with minute frequency: algo_obj = zipline.algorithm.TradingAlgorithm(data_frequency='minute', initialize=initialize, handle_data=handle_data) 4) run it perf = algo_obj.run(data)

dalejung commented 9 years ago

I tried running zipline yesterday and I ran into the same error when I added a moving average transform to my minute data. Wasn't sure if it was a problem with my setup since it was my first time running it and looking at the code.

http://nbviewer.ipython.org/gist/dalejung/1ab100b08cbb2dfe0877 is relatively self contained

shlomoa commented 9 years ago

@dalejung : using transform is being phased out but I think your test case can help debug this issue. What I did is buffer more history, like if the documented moving average algo says: context.i += 1 if context.i < 300: return I changed it to 400, the downside is that I'm losing minutes of trade time in that way. Also if the data is intraday your fine, but ifyour data spans more than a day you need to reset the context every day. It's sad, I know.

twiecki commented 9 years ago

Transform was recently refactored to use history so it's not being deprecated.

dalejung commented 9 years ago

For the minute stuff, this happens when the datasource has data outside the trading window. The cur_window_starts can only be market minutes, so when get_history tries to add buffer data it grabs an empty frame since earliest_minute is after algo_dt.

Not sure what the expected behavior should be. Would perhaps make sense to put a guard to block trade data events that are out of the environment's market hours.

ssanderson commented 9 years ago

I think this is an issue where we are making assumptions that there will not be missing days in daily mode, and instead these will be rows full of nans.

@llllllllll if this is the case then the issue should be fixed by pre-computing the expected index (which, conveniently, has already been done in TradingEnvironment) and then doing a reindex on the input data, right?

sebnil commented 9 years ago

Is someone actively working on this issue? I am trading on the Swedish market and run into this issue for most of the stocks that I trade with. Are there any good alternatives to using the history function as a work around?

sebnil commented 9 years ago

I investigated some more and the code below should show the problem:

from zipline.api import order_target, history, add_history
from zipline.utils.factory import load_from_yahoo
from zipline.algorithm import TradingAlgorithm

def initialize(context):
    add_history(10, '1d', 'price')
    context.i = 0

def handle_data(context, data):
    context.i += 1

    # this function will crash on some
    prices = history(10, '1d', 'price')

    order_target(context.security, 1000)

if __name__ == '__main__':

    # run the algorithm on these securities one by one
    securities = [
        'SKF-B.ST', # swedish company that does not work
        'VOLV-A.ST', # swedish company that does not work
        'AZN.ST', # swedish company that does not work
        'AAPL', # US company works
        'TSLA', # US company works
    ]
    for security in securities:
        # get data from yahoo
        data = load_from_yahoo(stocks=[security], indexes={}, start='20140101', end='20140501')

        # create and run algorithm
        algo = TradingAlgorithm(
            initialize=initialize,
            handle_data=handle_data)
        algo.security = security

        try:
            results = algo.run(data)
            print('OK running algorithm on security.: {}'.format(security))
        except IndexError:
            print('Could not run algorithm on security: {}'.format(security))

tobsch commented 8 years ago

Same issue on the german market (CET Timezone). Any ideas?

shlomoa commented 8 years ago

My advice is lame, sorry. Either dump zipline or debug yourself. I took the second path which turned to a long path joining the first..... On Jul 18, 2015 10:24 AM, "Tobias Schlottke" notifications@github.com wrote:

Same issue on the german market (CET Timezone). Any ideas?

— Reply to this email directly or view it on GitHub https://github.com/quantopian/zipline/issues/447#issuecomment-122508219.

tobsch commented 8 years ago

and what did you end up with?

kenhersey commented 8 years ago

Hey Thomas - I think I have the cause of this issue identified (or at least one time when this is being hit).

Whenever I have a data file which includes data on a day which zipline considers to not be a trading day, this error is flagged when it hits the ffill_buffer_from_prior_values() function.

To reproduce: just generate a csv file from a yahoo retrieval, and add a line in the datafile for MLK day or some other holiday. Make sure you add it after a point after which the history would have warmed up (else you won't see the issue). That should generate the error.

This also correlates to those who are having trouble with futures data inputs like me and foreign data inputs (sebnil & tobsch, noted above).

Knowing the cause, I'm hopeful you will know where to resolve, and/or identify a workaround. Perhaps we will need to change the trading calendar to fix this (or have it ignore the calendar).

Best, Ken

tobsch commented 8 years ago

Hi,

that sounds reasonable. Why don't you integrate a library like this one and let the user define the region?

https://github.com/novapost/workalendar

Best,

Tobias

kenhersey commented 8 years ago

Interesting package - I wonder if "working day" == "trading day"... likely. But, then we need to be able to define a superset of the calendars if you combine for example different country data in the same portfolio. (Then the ffills would fix the data.)

Thus, my hope is that there is a mechanism that Thomas can readily identify for the user being able to redefine the current trading calendar as the index of our pandas Panel (portfolio), which will be a superset of all the items (securities). If so, that would be proper operation.

shlomoa commented 8 years ago

A simulator with one identical method handle_data Except for that it is totally different. On Jul 18, 2015 10:55 AM, "Tobias Schlottke" notifications@github.com wrote:

and what did you end up with?

— Reply to this email directly or view it on GitHub https://github.com/quantopian/zipline/issues/447#issuecomment-122510053.

sebnil commented 8 years ago

I solved it by just not using the history function at all. Why does zipline have so much complexity for getting historical data when it can be done using pandas anyway? (not a criticism, but a genuine question)

I did something like this:

historical_data = better_history.get_history(context)

And then a new file with a simple function:

def get_history(context):
    try:
        return context.data[:context.datetime]
    except AttributeError:
        logging.error('context.data is not set. Make sure to include it in context variable.')
        raise

twiecki commented 8 years ago

Not sure what happened with the crazy re-assignments.

twiecki commented 8 years ago

Anyone who could take a look at this? @brianpfink maybe? CC @ehebert @ssanderson @jfkirk

ricpruss commented 8 years ago

One work around is to not trade days that the trading calendar thinks are non-trading days.

i.e. def handle_data(algo, data): if not algo.trading_environment.is_trading_day(algo.get_datetime().date()): return

Rest of your algo....

The other is seems to a custom trading calendar, if someone has an easy way to derive a trading calendar from a panel I would prefer that because clearly this solution I gave skips real trading days which are in the history data, this is obviously more true for people not using US trading days..

quantopian / zipline

Some algos produce an IndexError with history #447