quantopian / zipline

Zipline, a Pythonic Algorithmic Trading Library
https://www.zipline.io
Apache License 2.0
17.67k stars 4.72k forks source link

Problem running with data outside of provide benchmark date range #13

Closed michaelwills closed 8 years ago

michaelwills commented 11 years ago

This is using a local copy of zipline instead of the site-packages one.

The data is OHCL and other indicator exported out as CSV from Metatrader 4. The timestamps are then munged to be the index similar to fast-data-mining-with-pytables-and-pandas.pdf and also localized

data = read_csv(data_file)
data['time'] = None
for i in data.index:
    data['time'][i] = datetime.strptime(data['Date'][i] + " " + data['Time'][i] + ":00", '%m-%d-%Y %H:%M:%S')

data.index = data['time']
del data['time']

data.index = tseries.index.DatetimeIndex(data=data.index).tz_localize('US/Eastern')

and when trying out the algo

class TestAlgo(TradingAlgorithm):
    def handle_data(self,data):
        print data

my_algo = TestAlgo()
results = my_algo.run(data)

this results in

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-1-b9b16310508f> in <module>()
     24 
     25 my_algo = TestAlgo()
---> 26 results = my_algo.run(data)
     27 print results.portfolio_value

/Users/michael/Downloads/PandasTutorialFiles/zipline/algorithm.pyc in run(self, source, start, end)
    177         # loop through simulated_trading, each iteration returns a
    178         # perf ndict
--> 179         perfs = list(self.gen)
    180 
    181         # convert perf ndict to pandas dataframe

/Users/michael/Downloads/PandasTutorialFiles/zipline/gens/tradesimulation.pyc in simulate(self, stream_in)
    113         # day.  It will also yield a risk report at the end of the
    114         # simulation.
--> 115         for message in performance_messages:
    116             yield message
    117 

/Users/michael/Downloads/PandasTutorialFiles/zipline/gens/tradesimulation.pyc in transform(self, stream_in)
    202             # Group together events with the same dt field. This depends on the
    203             # events already being sorted.
--> 204             for date, snapshot in groupby(stream_in, attrgetter('dt')):
    205                 # Set the simulation date to be the first event we see.
    206                 # This should only occur once, at the start of the test.

/Users/michael/Downloads/PandasTutorialFiles/zipline/finance/performance.pyc in transform(self, stream_in)
    217                 yield event
    218             else:
--> 219                 event.perf_message = self.process_event(event)
    220                 event.portfolio = self.get_portfolio()
    221                 del event['TRANSACTION']

/Users/michael/Downloads/PandasTutorialFiles/zipline/finance/performance.pyc in process_event(self, event)
    249 
    250         if(event.dt >= self.market_close):
--> 251             message = self.handle_market_close()
    252 
    253         if event.TRANSACTION:

/Users/michael/Downloads/PandasTutorialFiles/zipline/finance/performance.pyc in handle_market_close(self)
    278         #update risk metrics for cumulative performance
    279         self.cumulative_risk_metrics.update(
--> 280             self.todays_performance.returns, datetime.timedelta(days=1))
    281 
    282         # increment the day counter before we move markers forward.

/Users/michael/Downloads/PandasTutorialFiles/zipline/finance/risk.pyc in update(self, returns_in_period, dt)
    417         self.algorithm_volatility.append(
    418             self.calculate_volatility(self.algorithm_returns))
--> 419         self.treasury_period_return = self.choose_treasury()
    420         self.excess_returns.append(
    421             self.algorithm_period_returns[-1] - self.treasury_period_return)

/Users/michael/Downloads/PandasTutorialFiles/zipline/finance/risk.pyc in choose_treasury(self)
    344             term=self.treasury_duration
    345         )
--> 346         raise Exception(message)
    347 
    348 

Exception: no rate for end date = 2012-04-17 00:00:00-04:00 and term = 1month. Check         that date doesn't exceed treasury history range.

At this point the basic test does work using the local copy of zipline which was my sanity check.

[edit: iPython notebook's trackback is clearer]

michaelwills commented 11 years ago

The generated timestamps are like

<class 'pandas.tseries.index.DatetimeIndex'>
[2012-04-16 17:30:00, 2012-04-16 17:35:00]
Length: 2, Freq: None, Timezone: US/Eastern
twiecki commented 11 years ago

I think the issue is that for some of the risk metrics (e.g. alpha, beta) we require a benchmark to be present (e.g. S&P500). This is loaded from the msgpack but only has a limited time range (and you are exceeding it).

I suppose there are two ways to fix this, none of them immediate unfortunately:

  1. Allow you to supply your own benchmark
  2. Allow you to run without a benchmark and then don't compute the risk metrics that require it.

P.S. Please keep those problem reports coming, it's very helpful for us!

michaelwills commented 11 years ago

I am using it in a non-standard way for sure. Forex isn't natively supported of course. The idea is to generate all the test data needed separately, i.e. use MT4 to export OHLC data (or just pop in tick data from my broker), and use exported indicator data to use in handle_data. I'm a bit new to this kind of backtesting so I'd like to understand what the risk metrics supplies. The comments in risk.py are quite helpful in this regard. I'd definitely like to know the sharpe ratio, etc.

So option 2 would be nice for a quick solution though it doesn't sound quick. :) But option 1 is more desirable for the long term.

[edit] Actually when I get some time I'll look to see how it builds the data. Maybe I can hack some data together and drop it in as a replacement for the treasuries msgpack.

twiecki commented 11 years ago

Thinking some more about this, an easy interface would be to just specify a column in your pandas dataframe that holds your indicator. People will probably want to use other benchmark data sets. That way one would just retrieve e.g. S&P500 alongside the data and it would also be the same range.

Pseudocode:

data = load_from_yahoo(stocks=['AAPL']) # loads SP500 automatically
dma = DualMovingAverage()
results = dma.run(data, benchmark='SP500') # will expect SP500 column in dataframe

If that isn't supplied we could try to fall back to the msgpack benchmark we provide now.

Sound sane?

michaelwills commented 11 years ago

That sounds good actually. I haven't inspected the benchmark data yet but it's just close prices? Does it have to be end of day data or would any timeframe matching my data suffice?

michaelwills commented 11 years ago

I see it's just

In [15]: data[-10:]

Out[15]:
(((2012, 10, 22, 0, 0, 0, 0), 0.00041864067373232743),
 ((2012, 10, 23, 0, 0, 0, 0), -0.014388940812141747),
 ((2012, 10, 24, 0, 0, 0, 0), -0.0031488819699972016),
 ((2012, 10, 25, 0, 0, 0, 0), 0.0022912026331096645),
 ((2012, 10, 26, 0, 0, 0, 0), -0.0007289609828941681),
 ((2012, 10, 31, 0, 0, 0, 0), 0.0008292050262582107),
 ((2012, 11, 1, 0, 0, 0, 0), 0.01089788981730624),
 ((2012, 11, 2, 0, 0, 0, 0), -0.009379443677806564),
 ((2012, 11, 5, 0, 0, 0, 0), 0.002291339585012948),
 ((2012, 11, 6, 0, 0, 0, 0), 0.00785318149104618))

I am assuming those are returns for the period, days in this case. If I am working with 5 minute bars would I need to provide that per bar?

michaelwills commented 11 years ago

I just realized that's the benchmark data, which I'd still need to provide I imagine. The bit that failed was the treasury data which is also daily data

(((2012, 11, 5, 0, 0, 0, 0),
  {'10year': 0.0172,
   '1month': 0.0009,
   '1year': 0.0019,
   '20year': 0.0247,
   '2year': 0.0028,
   '30year': 0.0288,
   '3month': 0.0011,
   '3year': 0.0038,
   '5year': 0.007,
   '6month': 0.0015,
   '7year': 0.0113,
   'tid': 5719}),
 ((2012, 11, 6, 0, 0, 0, 0),
  {'10year': 0.0178,
   '1month': 0.0012,
   '1year': 0.0019,
   '20year': 0.0252,
   '2year': 0.003,
   '30year': 0.0292,
   '3month': 0.001,
   '3year': 0.0041,
   '5year': 0.0075,
   '6month': 0.0015,
   '7year': 0.0119,
   'tid': 5720}))

Quantopian supports minute data so I assume zipline does as well. Will these data sets be fine as is with daily data? And since it searched for

2012-04-17 00:00:00-04:00

instead of something like

2012-04-17 00:00:00

could I essentially fill in the data with data from the nearest point to allow it to complete with a full risk report?

And finally, could it work to have treasuries optionally passed in the same way as the benchmark?

twiecki commented 11 years ago

I think for now we will just provide functionality to update the benchmark and treasury data. Ultimately it would be nicer if those could be user supplied.

Would that help for now?

michaelwills commented 11 years ago

That would and it is most appreciated!

Part of the challenge is to see what choose_treasury is looking for.

Gah I think I see it now. My timestamp is US/Eastern (-4:00) so I need to do the .tz_convert('UTC') in order for it to match. The day is there but the timezone is different so it could never find a match so there was no rate found. With this

data.index = tseries.index.DatetimeIndex(data=data.index).tz_localize('US/Eastern').tz_convert('UTC')

it's actually running and printing the data. I can keep digging now. Thank you for your patience!

michaelwills commented 11 years ago

Some further notes. I have arbitrary data going in and I can run tests but I still get exceptions which are probably expected given that I am using intraday data:

self.period_start = {Timestamp} 2012-11-06 14:10:00+00:00
self.trading_days[-1] = {datetime} 2012-11-06 00:00:00+00:00
(<type 'exceptions.AssertionError'>, AssertionError('Period start falls after the last known trading day.',), None)

"zipline/finance/trading.py", line 86, in __init__
    "Period start falls after the last known trading day."
AssertionError: Period start falls after the last known trading day.

That being the case if there is a simple way to allow running without the benchmark and calculated metrics (as in your comment @twiecki 2 days ago at https://github.com/quantopian/zipline/issues/13#issuecomment-10199210). I haven't gone through all the source but is there a relatively pain free way I can disable this? Or perhaps since trading days are calculated based on the benchmark returns I can fill that data out so it is accounted for.

At the moment I just catch the exception and let it go as far as possible so I am able to test strategies.

Thanks again for releasing this!

twiecki commented 11 years ago

Yeah, we really need to make this optional. You can look into finance/performance.py where the risk object is updated if you want.

On Sun, Nov 11, 2012 at 12:22 AM, michaelwills notifications@github.comwrote:

Some further notes. I have arbitrary data going in and I can run tests but I still get exceptions which are probably expected given that I am using intraday data:

self.period_start = {Timestamp} 2012-11-06 14:10:00+00:00self.trading_days[-1] = {datetime} 2012-11-06 00:00:00+00:00(<type 'exceptions.AssertionError'>, AssertionError('Period start falls after the last known trading day.',), None) "zipline/finance/trading.py", line 86, in init "Period start falls after the last known trading day."AssertionError: Period start falls after the last known trading day.

That being the case if there is a simple way to allow running without the benchmark and calculated metrics (as in your comment @twieckihttps://github.com/twiecki2 days ago at

13https://github.com/quantopian/zipline/issues/13#issuecomment-10199210).

I haven't gone through all the source but is there a relatively pain free way I can disable this? Or perhaps since trading days are calculated based on the benchmark returns I can fill that data out so it is accounted for.

At the moment I just catch the exception and let it go as far as possible so I am able to test strategies.

Thanks again for releasing this!

— Reply to this email directly or view it on GitHubhttps://github.com/quantopian/zipline/issues/13#issuecomment-10263387.

michaelwills commented 11 years ago

I'll have to look into that. Thanks!

tlmaloney commented 11 years ago

I have a similar issue, and I'd like to turn benchmarking off. Is there a reference for the procedure?

benmccann commented 11 years ago

+1 i just filed a similar bug (https://github.com/quantopian/zipline/issues/125) because I didn't notice this one. ^GSPC goes back until 1950, but yet we're limited to running backtests to 1990 because that's as far back as the treasury data goes.

ehebert commented 11 years ago

I'm actively working as a high priority on https://github.com/quantopian/zipline/issues/46 (streaming of benchmarks and treasury data), which is relevant to this issue. Since benchmarks and treasury as an input to certain risk metrics is the main reason that they are currently required.

While refactoring how benchmarks and treasury data are stored in risk, I'll see if I can get in some options/flags to disable them completely.

A question I have is, should the default date range remain as the one that has both benchmark and treasury data contained within? I'm tending towards saying 'yes', so as to provide richer metrics out of the box. With the disabling of the metrics being something that needs to be explicitly done.

benmccann commented 11 years ago

Makes sense to me. We could make the default date be 1990 since that's when the benchmarks start, but then add a good error message if you go outside and add an option to disable.

ehebert commented 11 years ago

Ben, I agree.

I'm thinking that the steps of setting the default date of 1990 and providing the warning/error would be at home one your defaults branch.

benmccann commented 11 years ago

@ehebert i updated the pull request (https://github.com/quantopian/zipline/pull/121) to use 1990 as the default start date. i'll leave the warning/error for another change. seems like it'd go well with the option to disable the benchmarks

benmccann commented 11 years ago

I think that the best fix for this is to use the 10-year treasury as a benchmark. It is more commonly used as a benchmark and there is good data for it back until the 1950s or 60s. See https://github.com/quantopian/zipline/issues/132

llllllllll commented 8 years ago

We ended up using the 10y treasury curve in https://github.com/quantopian/zipline/commit/d177ddd860fb6419767dd14b587e97615de31519