quantopian / zipline

Zipline, a Pythonic Algorithmic Trading Library
https://www.zipline.io
Apache License 2.0
17.53k stars 4.71k forks source link

Investigate: Nans in algorithm_returns #151

Closed twiecki closed 4 years ago

twiecki commented 11 years ago

If, in risk.py, we change (line 586 and 589):

self.algorithm_returns = self.algorithm_returns_cont.valid()
self.benchmark_returns = self.benchmark_returns_cont.valid()

to:

self.algorithm_returns = self.algorithm_returns_cont[:dt]
self.benchmark_returns = self.benchmark_returns_cont[:dt]

to do explicit slicing rather than implicit using valid(), I get the following test error:

======================================================================
ERROR: test_risk_metrics_returns (tests.test_risk_compare_batch_iterative.RiskCompareIterativeToBatch)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/whyking/working/projects/quant/zipline/tests/test_risk_compare_batch_iterative.py", line 104, in test_risk_metrics_returns
    self.all_benchmark_returns[todays_return_obj.date])
  File "/home/whyking/working/projects/quant/zipline/zipline/finance/risk.py", line 624, in update
    self.beta.append(self.calculate_beta()[0])
  File "/home/whyking/working/projects/quant/zipline/zipline/finance/risk.py", line 463, in calculate_beta
    eigen_values = la.eigvals(C)
  File "/home/whyking/envs/zipline/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 767, in eigvals
    _assertFinite(a)
  File "/home/whyking/envs/zipline/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 165, in _assertFinite
    raise LinAlgError, "Array must not contain infs or NaNs"
LinAlgError: Array must not contain infs or NaNs

Dropping into a debugger it seems that there is a NaN in front of the last dt which seems very odd:

>>> self.algorithm_returns
2006-01-03 00:00:00+00:00       NaN
2006-01-04 00:00:00+00:00    0.0093
twiecki commented 11 years ago

I think this user email is a related case:

There seems to be a glitch when the data source contains NaN's. The simulation runs for most part until risk.py raises an exception near the very end of the simulation. For instance, running the following strategy:

class Test(TradingAlgorithm):
    def handle_data(self,data):
        self.order('myStock',1)

if __name__ == '__main__':
    data = load_some_data_using_custom_method()
    test = Test()
    result = test.run(data)

I get the following traceback:

File "C:\quantopian\test.py", line 37 in <module>
    results = test.run(data)
File "C:\quantopian\zipline\algorithm.py", line 295 in run
    perfs = list(self.gen)
File "C:\quantopian\zipline\gens\tradesimulation.py", line 157, in transform
    yield self.get_message(date)
File "C:\quantopian\zipline\gens\tradesimulation.py", line 181, in get_message
    self.algo.perf_tracker.handle_market_close()
File "C:\qunatopian\zipline\finance\performance.py", line 389, in handle_market_close
    self.all_benchmark_returns[todays_return_obj.date])
File "C:\quantopian\zipline\finance\risk.py", in line 643, in update
    raise Exception(message)
Exception: Mismatch between benchmark_returns (1095) and algorithm_returns (1094) in range 1997-01-02 00:00:00+00:00 : 2004-12-31 00:00:00+00:00 on 2001-05-04 00:00:00+00:00

Bug replication:
- I looked back at data and found NaN entries beginning from observation 1095 onwards. data contains about 50 stocks traded at different exchanges (hence different holidays and NaN's to pad the gaps).
- Using the same data set, where 'myStock' contains NaN's, but applying orders only to 'anotherStock' containing no NaN's, the entire backtest runs successfully.
- My zipline repo is updated to one of the May 08 2013 commits.
ehebert commented 11 years ago

Found a bug in the test.test_risk_compare_batch_iterative which was never calling update for the leading. Fix is here, https://github.com/quantopian/zipline/commit/16c488e5bcb455c795c3535c170b8ae798558a99

Still, we should investigate what to do with missing returns. Benchmarks I think we can replace with a 0.0. But algorithms with missing data, trying to reason if 0.0 would be since, a NaN would imply no volume (so no trades could change the portfolio), and there would be no change in the pricing information with a NaN. (That assumes a NaN really means 'no trades happened here'.)

However, the above suggestions would be masking the problem in risk, and I think we should investigate what is going in on at the performance module level with the email snipped you attached, since now that we use benchmarks as a 'clock' we should be filling the algorithm returns with values throughout.

Also, @twiecki, I forget, besides the unit tests, was it an example algo or another algo you were working on where you first discovered this? (i.e. what stocks and date range were you working with that had the NaNs.)

ehebert commented 11 years ago

@twiecki so not to confuse, https://github.com/quantopian/zipline/commit/16c488e5bcb455c795c3535c170b8ae798558a99 does not contain the valid vs. [:dt] fix, yet. But I do believe it gets the tests in shape to be ready for it.

GiliR4t1qbit commented 10 years ago

Was this ever fixed? I seem to be getting the same type of error, most probably for the same reason (having at least one NaN in the data).

ehebert commented 10 years ago

@GiliR4t1qbit, I'm not sure if this was ever fixed, and can look into it later this week.

Could you share the conditions under which you are seeing the error?

GiliR4t1qbit commented 10 years ago

The data I am working with has lots of NaN's in it, due to stocks that had not yet been traded at the beginning of the period. When I realized this, I decided to add a check to handle_data to only include stocks that do not have NaN's for that time period. This got rid of the error. I suspect I was trying to trade a stock whose price was NaN, but I'm not 100% sure. If this is the case, it would be good if the program did not crash, for example, a reasonable behaviour would be for the order to not be fulfilled and a warning message to appear.

On Tue, Feb 18, 2014 at 6:07 AM, Eddie Hebert notifications@github.comwrote:

@GiliR4t1qbit https://github.com/GiliR4t1qbit, I'm not sure if this was ever fixed, and can look into it later this week.

Could you share the conditions under which you are seeing the error?

Reply to this email directly or view it on GitHubhttps://github.com/quantopian/zipline/issues/151#issuecomment-35350835 .

ehebert commented 10 years ago

Ah, your error makes sense then, but may be surprising, since one of the current system assumptions made, which may be both:

is that 'drop nans' logic is done in the module providing the data source generator, i.e. no data for the equity at a given dt is emitted if the price is nan.

Handling the nan values upstream, probably in tradesimulation, would be more robust, but would have to be considered against performance trade-offs.

shlomoa commented 9 years ago

The issue is still there, it has to do with Yahoo data as I am following zipline tutorial

jdricklefs commented 4 years ago

Closing due to age. If anyone is still experiencing this, feel free to reopen.