quantopian / zipline

Zipline, a Pythonic Algorithmic Trading Library
https://www.zipline.io
Apache License 2.0
17.51k stars 4.71k forks source link

Intraday performance report when using 1 minute data? #2040

Open kvak opened 6 years ago

kvak commented 6 years ago

Q0017.txt Dear Zipline Maintainers,

Before I tell you about my issue, let me describe my environment:

Environment

alembic==0.9.6
bcolz==0.12.1
Bottleneck==1.2.1
certifi==2017.11.5
chardet==3.0.4
click==6.7
contextlib2==0.5.5
cycler==0.10.0
cyordereddict==1.0.0
Cython==0.27.3
decorator==4.1.2
empyrical==0.3.3
idna==2.6
intervaltree==2.1.0
Logbook==1.1.0
lru-dict==1.1.6
Mako==1.0.7
MarkupSafe==1.0
matplotlib==2.1.0
multipledispatch==0.4.9
networkx==2.0
numexpr==2.6.4
numpy==1.13.3
pandas==0.18.1
pandas-datareader==0.5.0
patsy==0.4.1
pyparsing==2.2.0
python-dateutil==2.6.1
python-editor==1.0.3
pytz==2017.3
requests==2.18.4
requests-file==1.4.2
requests-ftp==0.3.1
scipy==1.0.0
six==1.11.0
sortedcontainers==1.5.7
SQLAlchemy==1.1.15
statsmodels==0.8.0
tables==3.4.2
toolz==0.8.2
urllib3==1.22
zipline==1.1.1

Now that you know a little about me, let me tell you about the issue I am having:

Description of Issue

The ingestion step works fine:

zipline ingest -b ingester
entering machina.  tuSymbols= ('Q0017',)
about to return ingest function
entering ingest and creating blank dfMetadata
dfMetadata <class 'pandas.core.frame.DataFrame'>
<bound method NDFrame.describe of   start_date   end_date auto_close_date symbol
0 1970-01-01 1970-01-01      1970-01-01   None>
S= Q0017 IFIL=/merged_data/Q0017.csv
read_csv dfData <class 'pandas.core.frame.DataFrame'> length 7717 2017-06-15 22:00:00
start_date <class 'pandas.tslib.Timestamp'> 2017-06-15 22:00:00 None
end_date <class 'pandas.tslib.Timestamp'> 2017-06-23 20:00:00 None
ac_date <class 'pandas.tslib.Timestamp'> 2017-06-24 20:00:00 None
liData <class 'list'> length 1
Now calling minute_bar_writer
returned from minute_bar_writer
calling asset_db_writer
dfMetadata <class 'pandas.core.frame.DataFrame'>
           start_date            end_date     auto_close_date symbol exchange
0 2017-06-15 22:00:00 2017-06-23 20:00:00 2017-06-24 20:00:00  Q0017      ICE
symbol_map <class 'pandas.core.series.Series'>
returned from asset_db_writer
calling adjustment_writer
returned from adjustment_writer
now leaving ingest function

So I try to run this toy example:


from zipline.api import symbol, get_datetime, record, order_target, history
from pytz import timezone

def initialize(context):
    context.contract  = symbol("Q0017")
    context.i         = 0

def handle_data(context, data):
    context.i += 1
    if context.i < 30:
        return

    # Compute averages
    # history() has to be called with the same params
    # from above and returns a pandas dataframe.
    short_mavg = data.history(context.contract, 'close', 10, '1m').mean()
    long_mavg  = data.history(context.contract, 'close', 30, '1m').mean()

    # Trading logic
    if short_mavg > long_mavg:
        # order_target orders as many shares as needed to
        # achieve the desired number of shares.
        order_target(context.contract, 100)
    elif short_mavg < long_mavg:
        order_target(context.contract, 0)

    # Save values for later inspection
    record(Q0017      = data.current(context.contract, "close"),
           short_mavg = short_mavg,
           long_mavg  = long_mavg)

# Note: this function can be removed if running
# this algorithm on quantopian.com
def analyze(context=None, results=None):
    import matplotlib.pyplot as plt
    import logbook
    logbook.StderrHandler().push_application()
    log = logbook.Logger('Algorithm')

    fig = plt.figure()
    ax1 = fig.add_subplot(211)
    results.portfolio_value.plot(ax=ax1)
    ax1.set_ylabel('Portfolio value (USD)')

    ax2 = fig.add_subplot(212)
    ax2.set_ylabel('Price (USD)')

    # If data has been record()ed, then plot it.
    # Otherwise, log the fact that no data has been recorded.
    if ('Q0017' in results and 'short_mavg' in results and
            'long_mavg' in results):
        results['Q0017'].plot(ax=ax2)
        results[['short_mavg', 'long_mavg']].plot(ax=ax2)

        trans = results.ix[[t != [] for t in results.transactions]]
        buys = trans.ix[[t[0]['amount'] > 0 for t in
                         trans.transactions]]
        sells = trans.ix[
            [t[0]['amount'] < 0 for t in trans.transactions]]
        ax2.plot(buys.index, results.short_mavg.ix[buys.index],
                 '^', markersize=10, color='m')
        ax2.plot(sells.index, results.short_mavg.ix[sells.index],
                 'v', markersize=10, color='k')
        plt.legend(loc=0)
    else:
        msg = 'Q0017, short_mavg & long_mavg data not captured using record().'
        ax2.annotate(msg, xy=(0.1, 0.5))
        log.info(msg)

    plt.show()

And it runs. but the performance report:


zipline run -f my_first_backtest.py --bundle ingester --data-frequency minute -s 2017-06-15 -e 2017-06-23
entering machina.  tuSymbols= ('Q0017',)
about to return ingest function
[2017-12-04 23:31:30.040649] WARNING: Loader: Refusing to download new benchmark data because a download succeeded at 2017-12-04 23:19:05.652054+00:00.
[2017-12-04 23:31:33.389892] INFO: Performance: Simulated 7 trading days out of 7.
[2017-12-04 23:31:33.390008] INFO: Performance: first open: 2017-06-15 13:31:00+00:00
[2017-12-04 23:31:33.390093] INFO: Performance: last close: 2017-06-23 20:00:00+00:00
                           Q0017  algo_volatility  algorithm_period_return  \
2017-06-15 20:00:00+00:00    NaN              NaN                 0.000000   
2017-06-16 20:00:00+00:00  47.29         0.000104                -0.000009   
2017-06-19 20:00:00+00:00  46.81         0.002406                -0.000276   
2017-06-20 20:00:00+00:00  45.84         0.006135                -0.001100   
2017-06-21 20:00:00+00:00  44.78         0.012046                -0.002896   
2017-06-22 20:00:00+00:00  45.29         0.013814                -0.002144   
2017-06-23 20:00:00+00:00  45.61         0.012915                -0.002037

has only 7 rows (one per day). Given 7717 intraday bars, I would expect to see one row per bar (7717 of them).

imkoukou commented 6 years ago

The problem is in algorithm.py, you can find the following main loop for all bars: for perf in self.get_generator(): perfs.append(perf)

        # convert perf dict to pandas dataframe
        daily_stats = self._create_daily_stats(perfs)
        self.analyze(daily_stats)

        return daily_stats

the daily_stats keeps all information for analyze() results parameter. Inside the self._create_daily_stats you can find results data is only recorded by every day. [ My solution ] Add a _create_minute_stats() fuction to replace the _create_daily_stats() function. You can just copy the _create_daily_stats function definition, and replace all "daily" string to "minute". I tested, that works. Then in the loop "for perf in self.get_generator():" use the minute stats like:

        minute_stats = self._create_minute_stats(perfs)
        self.analyze(minute_stats) 

       return minute_stats
apoorvkalra commented 6 years ago

@imkoukou could you place an edited algorithm.py file here? i tried replacing all the daily string to minute string and end up getting no performance data at all..

imkoukou commented 6 years ago

@apoorvkalra Hi. I updated the edited algorithm.py I am using and the original one. You can compare them to see what I changed.

The displayed report depends on what is returned by def run() function in algorithm.py. I would like to return the daily report, cause volume of minute data report is too annoying. It would be displayed like this.... minute report

algorithm.zip

studyquant commented 6 years ago

it doesn't work in my zipline 1.2.0, python 3.5 . since the perf variable has no 'minute_perf' key.

`
try: perfs = [] self.get_generator() for perf in self.get_generator(): perfs.append(perf)

        # convert perf dict to pandas dataframe

        daily_stats = self._create_daily_stats(perfs)
        if self.sim_params.data_frequency == 'daily':
            self.analyze(daily_stats)

        minute_stats = self._create_minute_stats(perfs)
        # Revised by studyquant
        if self.sim_params.data_frequency == 'minute':
            minute_stats = self._create_minute_stats(perfs)
            self.analyze(minute_stats)
    finally:
        self.data_portal = None

`

` for perf in perfs: if 'minute_perf' in perf:

print("perf is:\n",perf)

            perf['minute_perf'].update(
                perf['minute_perf'].pop('recorded_vars')
            )
            perf['minute_perf'].update(perf['cumulative_risk_metrics'])

            daytime = perf['period_start'].strftime("%a")
            # Only analyze day is in the CustomBusinessDay of our calenday.
            if (daytime in workDayStrList):
                minute_perfs.append(perf['minute_perf'])
        else:
            self.risk_report = perf

    minute_dts = pd.DatetimeIndex(
        [p['period_close'] for p in minute_perfs], tz='UTC'
    )

`

imkoukou commented 6 years ago

@studyquant

  1. Have you done the correct setting for minute csv data, including calendar setting? In the last version of Zipline, the minute data process have been included, which is much easier to configure for minute data simulation.
  2. Did you run the script with minute data_frequency, and the specified start time and end time is within the csv data range
studyquant commented 6 years ago

Dear imkoukou: thank you for your reply. My data setting and calendar setting is current. And now I have updated the zipline into 1.3.0 version.

The panel like this:

date                                                                      
2018-02-05 17:48:00  7170.000000  7171.000000  7170.000000  7170.990234   
2018-02-05 17:49:00  7131.990234  7171.000000  7170.990234  7131.990234   
2018-02-05 17:50:00  7120.000000  7137.359863  7132.000000  7120.020020   
2018-02-05 17:51:00  7113.000000  7121.000000  7120.040039  7113.000000   
2018-02-05 17:52:00  7113.000000  7122.000000  7113.000000  7121.990234   

volume  
date                            
2018-02-05 17:48:00   3.425961  
2018-02-05 17:49:00   5.209975  
2018-02-05 17:50:00  14.767619  
2018-02-05 17:51:00  18.237879  
2018-02-05 17:52:00  22.768671  
<class 'pandas.core.panel.Panel'>
Dimensions: 1 (items) x 72277 (major_axis) x 5 (minor_axis)
Items axis: BTC to BTC
Major_axis axis: 2018-02-05 17:48:00+00:00 to 2018-03-27 22:24:00+00:00
Minor_axis axis: low to volume

in the zipline/ algorithm.py file, I have add a function in class

        # create minute and cumulative stats dataframe
        minute_perfs = []
        workDayStrList = self.trading_calendar.day.weekmask.split(" ")
        # TODO: the loop here could overwrite expected properties
        # of minute_perf. Could potentially raise or log a
        # warning.
        # perfDF = pd.DataFrame(perfs)
        # print("daily stats perfs are:\n",perfDF.head(),"\n...\n",perfDF.tail())

        for perf in perfs:
            if 'minute_perf' in perf:
                # print("perf is:\n",perf)

                perf['minute_perf'].update(
                    perf['minute_perf'].pop('recorded_vars')
                )
                perf['minute_perf'].update(perf['cumulative_risk_metrics'])

                daytime = perf['period_start'].strftime("%a")
                # Only analyze day is in the CustomBusinessDay of our calenday.
                if (daytime in workDayStrList):
                    minute_perfs.append(perf['minute_perf'])
            else:
                self.risk_report = perf

        minute_dts = pd.DatetimeIndex(
            [p['period_close'] for p in minute_perfs], tz='UTC'
        )

        minute_stats = pd.DataFrame(minute_perfs, index=minute_dts)

        return minute_stats

and revisde

`        try:
            perfs = []
            for perf in self.get_generator():
                perfs.append(perf)

            # convert perf dict to pandas dataframe
            daily_stats = self._create_daily_stats(perfs)
            if self.sim_params.data_frequency == 'daily':
                self.analyze(daily_stats)

            ### Revised by me ###
            # user srcript analyze is executed here:
            # ### user file : analyze(context=None, results=None)
            if self.sim_params.data_frequency == 'minute':
                minute_stats = self._create_minute_stats(perfs)
                self.analyze(minute_stats)
        finally:
            self.data_portal = None

        # return None
        # return daily_stats shall display the daily report after simulation even if in minute frequency. 
        # return daily_stats
        # return minute_stats shall display the minute report after simulation, NA for daily frequency.
        return minute_stats
        `

it does not display introday report. since in _create_minute_stats(self, perfs) function:

`        for perf in perfs:
            if 'minute_perf' in perf:
                # print("perf is:\n",perf)

                perf['minute_perf'].update(
                    perf['minute_perf'].pop('recorded_vars')
                )
                perf['minute_perf'].update(perf['cumulative_risk_metrics'])

                daytime = perf['period_start'].strftime("%a")
                # Only analyze day is in the CustomBusinessDay of our calenday.
                if (daytime in workDayStrList):
                    minute_perfs.append(perf['minute_perf'])
            else:
                self.risk_report = perf`

`'minute_perf' in perf
Out[8]: 
False`

So, there is no minute_perf in perf varable. while, the daily_perf is in perf variable.

daily_perf in perf
Out[11]: 
True

due to no minute_perf key in perf varable, the minute_stats is returned, it is a empty dataframe

`perf = zipline.run_algorithm(start=datetime(2018, 3, 8, 0, 0, 0, 0, pytz.utc),
                             end=datetime(2018, 3, 10, 0, 0, 0, 0, pytz.utc),
                             initialize=initialize,
                             trading_calendar=TwentyFourHR(),
                             capital_base=1000000,
                             handle_data=handle_data,
                             data_frequency='minute',
                             data=panel)
print(perf)
Empty DataFrame
Columns: []
Index: []`

I have download the files you uploaded, replaced the algorithm file and run. it unable to run, some of packages are not able to import since i do not some files in my local envirenment. For example, from zipline.finance.performance import PerformanceTracker. the defult zipline 1.3.0 has no performance folder in path\zipline\finance, it has no calendars folder in zipline\utils. As a result, I just revised the algorithm file by your description. it unable to show introday performance report.

imkoukou commented 6 years ago

@studyquant The file I posted are for a older version of Zipline. The Version 1.3.0 changed the folder structure, even the calendar has been separated as a single package folder outside the Zipline folder. I updated to Zipline 1.3.0 recently, and encountered the same problem at the first time, that 'minute_perf' not in perf. For me, the problem was solved by:

  1. The calendar is default initialized with get_calendar("NYSE") in file of "zipline\utils\run_algo.py", that not match my minute data well. I customized it, and specified the customized calendar for minute frequency data.
  2. I found that initialization of TradingAlgorithm class in file of "zipline\utils\run_algo.py",the emission_rate is not correctly specified even if I used a command --data-frequency minute to tell zipline run with minute mode. I added the parameter of "emission_rate = data_frequency".
  3. Also as older version, added my function of "_create_minute_stats" to return minute_stats.

Make sure you have registered a suitable calendar for your minute csv data, including the date in the calendar.

That is all I've done for Zipline 1.3.0. And it can return the right minute data report to me. algorithm for Zipline Version 1.3.0.zip

studyquant commented 6 years ago

@imkoukou well, it works. However, I guess there are some error in algorithm file you provided. it shows this ...

  File "C:\Anaconda3.5-64\lib\site-packages\zipline\algorithm.py", line 759, in run
    for perf in self.get_generator():
  File "C:\Anaconda3.5-64\lib\site-packages\zipline\algorithm.py", line 632, in get_generator
    return self._create_generator(self.sim_params)
  File "C:\Anaconda3.5-64\lib\site-packages\zipline\algorithm.py", line 607, in _create_generator
    metrics_tracker.handle_start_of_simulation(benchmark_source)
  File "C:\Anaconda3.5-64\lib\site-packages\zipline\finance\metrics\tracker.py", line 144, in handle_start_of_simulation
    benchmark_source,
  File "C:\Anaconda3.5-64\lib\site-packages\zipline\finance\metrics\tracker.py", line 127, in hook_implementation
    impl(*args, **kwargs)
  File "C:\Anaconda3.5-64\lib\site-packages\zipline\finance\metrics\metric.py", line 190, in start_of_simulation
    daily_returns_series = benchmark_source.daily_returns(
AttributeError: 'NoneType' object has no attribute 'daily_returns'

Anyway, currently, the introday report is working. I really thanks for your help. I have uploaded the algorithm I use below. algorithm.zip

imkoukou commented 6 years ago

@studyquant The error may be caused some other changes I did to benchmark function, just ignore it. : )

dpkdeepakpandey commented 5 years ago

@studyquant @imkoukou I am new to zipline and currently facing an issue while returning the minute performance.I have a dataset that consist of minute by minute data of 1 week but while running the algorithm(zipline.run_algorithm) i am getting performance(perf) on daily basis.I had made the changes which need to be done in run_algo.py and algorithm.py but still the performance(perf) is reflecting on the basis of daily performance.The data is entering into _create_minute_stats but while returning it is giving empty dataframe.

balut91 commented 5 years ago

@dpkdeepakpandey I found the solution. In def _create_minute_stats(self, perfs): comment if (daytime in workDayStrList):

The workDayStrList is coming some random value which is not correct. Anyways the working days is already taken care when generating the perf results. I don't think we need to put that condition again to check working day list. Please reply to this if this is correct/wrong.

dpkdeepakpandey commented 5 years ago

Yeah workDayStrList was coming with value [1111100] which i guess was equivalent to [Mon,Tue,Wed,Thu,Fri,Sat,Sun] but it was not helping.So,for the same i have given by default myworking days as [Mon,Tue,Wed,Thu,Fri] and later on we are comparing this with daytime and it works fine for minute level data.

Currently i am facing issues when i want to do the same with '5min','10min','15min'.Every time it is expecting data as minute by minute. Does zipline support for these minute integration also as i am not fully aware of it.

balut91 commented 5 years ago

@dpkdeepakpandey Even I am looking a solution for the same. My strategy depends on different time frames like u said '5min' , '10min'. I came across something called as batch_transform in zipline but looks like it is deprecated. Please do a reply if you get any solution for this.

balut91 commented 5 years ago

@dpkdeepakpandey

Check out this link. It may help you https://www.quantopian.com/posts/how-to-get-rsi-in-30-minutes-time-frame https://www.quantopian.com/posts/how-to-chunk-minutely-data-into-5-15-78-minute-bars

imkoukou commented 5 years ago

@dpkdeepakpandey @balut91 My solution to run at any period by:

  1. In utils/events.py class EventManager(object) change the handle_data() method to return a value. That will receive the value returned by handle() function in your running file. For me, the handle_method is changed like this: def handle_data(self, context, data, dt):

    Revised by me: add return for handle_data

    rslts = []
    with self._create_context(data):
        for event in self._events:
            s = event.handle_data(
                context,
                data,
                dt,
            )
            rslts.append(s)
    return rslts
  2. In algorithm.py class TradingAlgorithm: Change the handle_data() method to return the value returned by self._handle_data(self, data). After you finished step.1 the self._handle_data can return something.
  3. In gens\tradesimulation.py class AlgorithmSimulator(object) def every_bar() in def transform(): save the returned value like rrr = handle_data(algo, current_data, dt_to_use). (After you finished step.2 the handle_data() can return something). Then in the transform(), you can find some like : elif action == MINUTE_END: minute_msg = self._get_minute_message( dt, algo, metrics_tracker, ) yield minute_msg

Here, if rrr is None (you can define it) , just jump the yield by continue key word, for me it is: if tmp_do:

if True:

                    minute_msg = self._get_minute_message(
                        dt,
                        algo,
                        metrics_tracker,
                    )
                    yield minute_msg
                else:
                    continue
  1. Finally, you can jump some minutes by return with None at entry of handle() function(you could also choose something else) in your running algorithm file.

The period to process minute bar should be customized by the interval when handle() return not None. For example you can write to return 59 None + 1 not None to simulate 60minutes bar data.
Besides, there is also some work you need to do: Compose new OHCL value for 60minute bar by using 60 1minute OHCL bars....

dpkdeepakpandey commented 5 years ago

@imkoukou Can you post those files where changes are required?

And what is the meaning of composing new OHCL value?Why it is required?

dpkdeepakpandey commented 5 years ago

@balut91 Did you got anything new when interval is of 5min,10min or 15min? I am not getting any help from the link which you have shared.

tstevens02127 commented 4 years ago

Hello, I've been trying to run minute-level backtests with some issues but I've found your edits to the algorithm.py invaluable. I've got it to work now but my output has a strange quality. Even though I have minute level data:

2020-05-08 09:44:00+00:00 2020-05-08 09:45:00+00:00 2020-05-08 09:46:00+00:00

My output zeros out everything but the day, tossing the hour and minute detail out. So, for a given trading day, I've got a series of +400 lines of results that all share the same timestamp (that day's date). Is this an issue that you encountered? What part of this process could lead to this? Many thanks for your insight Output:

2020-05-08 00:00:00+00:00 2020-05-08 00:00:00+00:00 2020-05-08 00:00:00+00:00

tstevens02127 commented 4 years ago

@imkoukou You mentioned that you made changes to the benchmark, some of which I can see in the algorithm file you gave us. Would you mind sharing the other edits you made elsewhere in the zipline files? I'm tried to delete the whole thing but it's a mess. Thank for your insight

AceFromSpace commented 3 years ago

Hi I suppose I had solution for yours issue but for the newest version 1.4.1.

In the file zipline/finance/traiding.py, change default value of emission_rate to 'minute':

class SimulationParameters(object):
    def __init__(self,
                 start_session,
                 end_session,
                 trading_calendar,
                 capital_base=DEFAULT_CAPITAL_BASE,
                 emission_rate='minute',
                 data_frequency='daily',
                 arena='backtest'):

After that in zipline/algorith.py will be available the 'minut_perf' in perfs like in the previous versions of zipline. So you just need to create function _create_minute_stats(self, perfs) and use it instead of _create_daily_stats(self, perfs), to parse data from simulation.

def _create_minute_stats(self, perfs):
        # create daily and cumulative stats dataframe
        minute_perfs = []
        # TODO: the loop here could overwrite expected properties
        # of daily_perf. Could potentially raise or log a
        # warning.
        for perf in perfs:
            if 'minute_perf' in perf:

                perf['minute_perf'].update(
                    perf['minute_perf'].pop('recorded_vars')
                )
                perf['minute_perf'].update(perf['cumulative_risk_metrics'])
                minute_perfs.append(perf['minute_perf'])
            else:
                self.risk_report = perf

        minute_dts = pd.DatetimeIndex(
            [p['period_close'] for p in minute_perfs], tz='UTC'
        )
        minute_stats = pd.DataFrame(minute_perfs, index=minute_dts)
        return minute_stats

That works for me. But let me know if there will be still any issues with that.
zipline_minute_perf_1_4_1.zip

tstevens02127 commented 3 years ago

Very good Ace! I have not updated to 1.4.1 but I was able to crowbar a solution for 1.3 (my setup). Can share if interested... and thanks for sharing yours.

RiccaDS commented 2 years ago

I confirm @AceFromSpace version works a charm for now. I am actually using zipline-reloaded. Thx