scrtlabs / catalyst

An Algorithmic Trading Library for Crypto-Assets in Python
http://enigma.co
Apache License 2.0
2.48k stars 723 forks source link

Approaches for speeding up backtesting #313

Open Thomas214 opened 6 years ago

Thomas214 commented 6 years ago

In the Catalyst Developer Forum we started to discuss some approaches to improve the performance of the backtesting algorithm.

I just moved the discussion over here and will come back with an elaboration of my observation soon. Feel free to post your own ideas.

lenak25 commented 6 years ago

Thanks @Thomas214 for opening this!

vrminds commented 6 years ago

Here is an speed improvement I found on quantopian, with the tradeoff of loosing the performance tracker. i.e. you have to do all the performance measurements at the end of the backtest period 'by hand'.

An issue I haven't solved yet: I can't set the parameter 'fast_backtest' inside the function, e.g. run_algorithm(..., fast_backtest=True). With the code below you have to manually set self.fast_backtest = kwargs.pop('fast_backtest', False) to True in algorithm.py.

Modified files: -> algorithm.py -> tracker.py -> tradesimulation.py

algorithm.py line 308:

self.fast_backtest = kwargs.pop('fast_backtest', False)

        self.sim_params = kwargs.pop('sim_params', None)
        if self.sim_params is None:
            self.sim_params = create_simulation_parameters(
                start=kwargs.pop('start', None),
                end=kwargs.pop('end', None),
                trading_calendar=self.trading_calendar,
            )

        self.sim_params.fast_backtest = self.fast_backtest

tracker.py line 84: self.fast_backtest = sim_params.fast_backtest

tracker.py line 461:

if not self.fast_backtest:
            bms = pd.Series(
                index=self.cumulative_risk_metrics.cont_index,
                data=self.cumulative_risk_metrics.benchmark_returns_cont)
            ars = pd.Series(
                index=self.cumulative_risk_metrics.cont_index,
                data=self.cumulative_risk_metrics.algorithm_returns_cont)
            acl = self.cumulative_risk_metrics.algorithm_cumulative_leverages

            risk_report = risk.RiskReport(
                ars,
                self.sim_params,
                benchmark_returns=bms,
                algorithm_leverages=acl,
                trading_calendar=self.trading_calendar,
                treasury_curves=self.treasury_curves,
            )

            return risk_report.to_dict()
        else:
            return []

tradesimulation.py line 45: self.fast_backtest = sim_params.fast_backtest

tradesimulation.py line 232: elif action == SESSION_END and not self.fast_backtest:

tradesimulation.py line 232: elif action == MINUTE_END and not self.fast_backtest:

eric-valente commented 6 years ago

Would love some additional color here. Minute backtests are very, very slow. I do not need all the summary statistics - would love a way to filter which calculations I want (just pnl, portfolio_value, etc.). Need some way to speed this up - really make iteration hard.

Would love the above changes committed to the repo.

Thanks!

Thomas214 commented 6 years ago

Sorry for the delay. Here is how I speed up backtesting.

I figured out that Catalyst takes a bit of time providing the values via data.current:

open = data.current(context.asset, 'open')
high = data.current(context.asset, 'high')
low = data.current(context.asset, 'low')
close = data.current(context.asset, 'close')
volume = data.current(context.asset, 'volume')

If I do this every minute I end up with a computation time of 2.4 min per month. A whole year would then take around 30 min.

My idea was to save all data into my own data structure and call the values from there. So in a first run I export all desired minute data to a json file like this:

In the handle_data function I get the current data from catalyst and save it together with the timestamp into my data structure:

currentTimeString = str(data.current_dt)
currentData = {
               'open':   data.current(context.asset, 'open'),
               'high':   data.current(context.asset, 'high'),
               'low':    data.current(context.asset, 'low'),
               'close':  data.current(context.asset, 'close'),
               'volume': data.current(context.asset, 'volume'),
               }
context.allData[currentTimeString] = currentData

In the analyze function I save this data structure to a json file (import json):

with open('data.json', 'w') as outfile:
    json.dump(context.allData, outfile)

I do all of this just once! Now I have a json file with all saved data. (One year of open, high, low, close and volume values should result in around 50 MB of data.)

In all further runs I import this file into my own data structure in the initialize function:

with open('data.json') as infile:
    context.allData = json.load(infile)

Now I’m just calling these values in the handle_data function instead of asking Catalyst for it:

currentTimeString = str(data.current_dt)
open = context.allData[currentTimeString]['open']
high = context.allData[currentTimeString]['high']
low = context.allData[currentTimeString]['low']
close = context.allData[currentTimeString]['close']
volume = context.allData[currentTimeString]['volume']

If I do this every minute I end up with a computation time of only 18 sec per month. A whole year would then take around 3.5 min.

So with this approach getting the minute data runs around 8 times faster.

(You can also extract data as described in the Catalyst Documentation.)

eric-valente commented 6 years ago

@Thomas214 Thanks for this - it did not seem to speed things up for me. Are you still using context.asset all in the backtest?

calclavia commented 6 years ago

Using the method proposed by @Thomas214 definitely speed things up, but as a library I believe Catalyst should handle this in the backend by caching data in an easy-to-access format.

justinfay commented 6 years ago

Thomas214s method made backtesting a little faster for me (roughly 10%) but I think because my strategy uses a lot of limit orders and checks if they are filled every tick catalyst is still loading the majority of price data the slow way.

It would be nice if all the data (or larger chunks) are read from the data dump files into memory during backtesting, this should be easy to do in backtesting as we have the start and end dates.

usgoose commented 5 years ago

@Thomas214 I am receiving this error when running your code changes:

    context.allData[currentTimeString] = currentData

AttributeError: 'ExchangeTradingAlgorithmBacktest' object has no attribute 'allData'

@ Catalyst team - Has there been any progress with this? Running a simple mean reversion strategy for 1 year of data takes about 3 hours for me. Not realistic at all unfortunately.

gpmn commented 5 years ago

a easy way is to break at the begining of handle_data.

hittimes = 0
def handle_data(context, data):
global hittimes
hittimes = hittimes + 1
if (hittimes % 3) != 0:
return

......... this could save some time if you are not so cared about the precision.

In the Catalyst Developer Forum we started to discuss some approaches to improve the performance of the backtesting algorithm.

I just moved the discussion over here and will come back with an elaboration of my observation soon. Feel free to post your own ideas.

Thomas214 commented 5 years ago

@Thomas214 I am receiving this error when running your code changes:

    context.allData[currentTimeString] = currentData

AttributeError: 'ExchangeTradingAlgorithmBacktest' object has no attribute 'allData'

Sorry, I forgot to mention that you have to initialize context.allData like this in the initialize function: context.allData = {}