Regarding data.history API performance

Dear Zipline Maintainers,

I have 2 questions related to the performance of data.history api at the end of this post.

Let me explain my situation first: I'm implementing a moving average trading algorithm. The algorithm will compare moving average of 4 assets: AAPL, AMZN, MSFT, GOOG

from zipline.api import order_target, record, symbol, symbols
def initialize(context):
    context.i = 0
    context.assets = symbols('AAPL','AMZN','MSFT','GOOG')

def handle_data(context, data):    
    # Skip first 200 days to get full windows
    context.i += 1
    if context.i < 200:
        return

    start = time()
    for asset in context.assets:
        short_mavg =  data.history(asset, 'price', bar_count=50, frequency="1d").mean()
        long_mavg = data.history(asset, 'price', bar_count=200, frequency="1d").mean()

        if short_mavg > long_mavg:
            order_target(asset, 100)
        elif short_mavg < long_mavg:
            order_target(asset, 0)

        record(**{
                    asset.symbol:data.current(asset, 'price'),
                    asset.symbol + "short_mavg": short_mavg,
                    asset.symbol + "long_mavg" : long_mavg
                })

    end = time()
    print("handle data time", end-start)

When I run the back-testing from 2016 to 2018, the handle_data function takes 0.02 seconds to finish. However, when I run the backtesting from 2010 to 2018, handle_data function takes 0.15 seconds to finish, which is 7 times slower and make my backtesting very slow.

I have confirmed that data.history is the main reason for slowness. When I replace data.history with a fixed number, the average time for handle_data to finish is 0.02 second.

data.history (backtest from 2016-2018): 0.02s
data.history (backtest from 2010-2018): 0.15s (7 times slower)

fixed number (backtest from 2016-2018): 0.002
fixed number (backtest from 2010-2018): 0.002 (same as 2016-2018)

So, I have 2 questions related to the performance of data.history api.

The longer the back-testing timeframe is, the slower the data.history api becomes. Is this the expected behaviour of this api?
If that is expected, is there anything I can do to improve the speed of the API (Such as preload all the barData into RAM for selected assets) ?

Many thanks, Harry

Hi @conanak99, thanks for your report! Sorry for the (very) slow reply.

Generally, the length of each data.history should not be related to the overall length of the backtest (however, a longer backtest probably means more data.history calls).

One thing that comes to mind is that the price field is forward-filled. This means that if on any given bar (minute or daily, depending on your backtest granularity), if we don't have a price for the asset, we have to go searching for the last price we have. That can be a bit time consuming - but I'd be surprised if you were missing pricing data for those names.

Can you try using close or open or some other price field that isn't forward-filled, to see if it makes a difference?

quantopian / zipline

Regarding data.history API performance #2154