quantopian / zipline

Zipline, a Pythonic Algorithmic Trading Library
https://www.zipline.io
Apache License 2.0
17.28k stars 4.67k forks source link

custom bundle ingest error #2714

Open sabirjana opened 4 years ago

sabirjana commented 4 years ago

Dear Zipline Maintainers,

Before I tell you about my issue, let me describe my environment:

Environment

* Operating System: Windows 10 * Python Version: Python 3.5.6 * Python Bitness: 64 * How did you install Zipline: .yml file * Python packages: alembic==1.4.1 alphalens==0.4.0 asn1crypto==0.24.0 backcall==0.1.0 bcolz==0.12.1 bleach==3.1.1 Bottleneck==1.2.1 certifi==2018.8.24 cffi==1.11.5 chardet==3.0.4 Click==7.0 colorama==0.4.3 contextlib2==0.6.0.post1 cryptography==2.3.1 cryptography-vectors==2.3.1 cycler==0.10.0 cyordereddict==1.0.0 Cython==0.28.5 decorator==4.4.2 defusedxml==0.6.0 empyrical==0.5.0 entrypoints==0.2.3 idna==2.7 intervaltree==3.0.2 ipykernel==5.1.0 ipython==7.0.1 ipython-genutils==0.2.0 ipywidgets==7.4.2 jedi==0.12.1 Jinja2==2.11.1 joblib==0.14.1 jsonschema==2.6.0 jupyter-client==6.1.3 jupyter-console==6.1.0 jupyter-core==4.6.3 kiwisolver==1.0.1 Logbook==0.12.5 lru-dict==1.1.4 lxml==3.8.0 Mako==1.1.2 MarkupSafe==1.0 matplotlib==3.0.0 mistune==0.8.3 mkl-fft==1.0.9 mkl-random==1.0.1 multipledispatch==0.6.0 nbconvert==5.6.0 nbformat==5.0.4 networkx==1.11 notebook==6.0.3 numexpr==2.6.6 numpy==1.14.2 pandas==0.22.0 pandas-datareader==0.8.1 pandocfilters==1.4.2 parso==0.6.2 patsy==0.5.1 pexpect==4.6.0 pickleshare==0.7.5 prometheus-client==0.7.1 prompt-toolkit==2.0.10 ptyprocess==0.6.0 pycparser==2.20 pyfolio==0.9.2 Pygments==2.6.1 pyOpenSSL==18.0.0 pyparsing==2.4.6 pyportfolioopt==0.5.3 PySocks==1.6.8 python-dateutil==2.8.1 python-editor==1.0.4 pytz==2019.3 pywin32==227 pywinpty==0.5.4 pyzmq==17.1.2 qtconsole==4.7.1 QtPy==1.9.0 requests==2.20.1 requests-file==1.4.3 requests-ftp==0.3.1 scikit-learn==0.22.2.post1 scipy==1.1.0 seaborn==0.9.0 Send2Trash==1.5.0 simplegeneric==0.8.1 sip==4.19.12 six==1.11.0 sortedcontainers==2.1.0 SQLAlchemy==1.2.12 statsmodels==0.9.0 TA-Lib==0.4.9 tables==3.4.4 terminado==0.8.1 testpath==0.4.4 toolz==0.10.0 tornado==5.1.1 trading-calendars==1.11.2 traitlets==4.3.2 urllib3==1.23 wcwidth==0.1.8 webencodings==0.5.1 widgetsnbextension==3.4.2 win-inet-pton==1.0.1 win-unicode-console==0.5 wincertstore==0.2 wrapt==1.10.11 zipline==1.3.0

Now that you know a little about me, let me tell you about the issue I am having: I am getting following error while creating custom bundle using csv files


Loading 3MINDIA.NS...
Loading AARTIIND.NS...
Loading AAVAS.NS...
Loading ABBOTINDIA.NS...
Loading ABCAPITAL.NS...
Loading ABFRL.NS...
Loading ACC.NS...
Loading ADANIGAS.NS...
Loading ADANIGREEN.NS...
Loading ADANIPORTS.NS...
Loading ADANIPOWER.NS...
Loading ADANITRANS.NS...
Loading ADVENZYMES.NS...
Loading AEGISCHEM.NS...
Loading AFFLE.NS...
Loading AIAENG.NS...
Loading AJANTPHARM.NS...
Loading AKZOINDIA.NS...
Loading ALKEM.NS...
Loading ALLCARGO.NS...
Loading AMARAJABAT.NS...
Loading AMBER.NS...
Loading AMBUJACEM.NS...
Loading APLAPOLLO.NS...
Loading APLLTD.NS...
Loading APOLLOHOSP.NS...
Loading APOLLOTYRE.NS...
Loading ARVINDFASN.NS...
Loading ASAHIINDIA.NS...
Loading ASHOKA.NS...
Loading ASHOKLEY.NS...
Loading ASIANPAINT.NS...
Loading ASTERDM.NS...
Loading ASTRAL.NS...
Loading ASTRAZEN.NS...
Loading ATUL.NS...
Loading AUBANK.NS...
Loading AUROPHARMA.NS...
Loading AVANTIFEED.NS...
Loading AXISBANK.NS...
Loading BAJAJ-AUTO.NS...
Loading BAJAJCON.NS...
Loading BAJAJELEC.NS...
Loading BAJAJFINSV.NS...
Loading BAJAJHLDNG.NS...
Loading BAJFINANCE.NS...
Loading BALKRISIND.NS...
Loading BALMLAWRIE.NS...
Loading BALRAMCHIN.NS...
Loading BANDHANBNK.NS...
Loading BANKBARODA.NS...
Loading BANKINDIA.NS...
Loading BASF.NS...
Loading BATAINDIA.NS...
Loading BAYERCROP.NS...
Loading BBTC.NS...
Loading BDL.NS...
Loading BEL.NS...
Loading BEML.NS...
Loading BERGEPAINT.NS...
Loading BHARATFORG.NS...
Loading BHARTIARTL.NS...
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\env_zipline\Scripts\zipline-script.py", line 11, in <module>
    load_entry_point('zipline==1.3.0', 'console_scripts', 'zipline')()
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\click\core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\__main__.py", line 348, in ingest
    show_progress,
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\data\bundles\core.py", line 451, in ingest
    pth.data_path([name, timestr], environ=environ),
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\data\bundles\india_stock_data.py", line 64, in india_stock_data
    process_stocks(symbols, sessions, metadata, divs)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\data\us_equity_pricing.py", line 257, in write
    return self._write_internal(it, assets)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\data\us_equity_pricing.py", line 319, in _write_internal
    for asset_id, table in iterator:
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\data\us_equity_pricing.py", line 248, in <genexpr>
    (sid, self.to_ctable(df, invalid_data_behavior))
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\data\bundles\india_stock_data.py", line 93, in process_stocks
    df = df.reindex(sessions.tz_localize(None))[start_date:end_date]
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\util\_decorators.py", line 127, in wrapper
    return func(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\frame.py", line 2935, in reindex
    return super(DataFrame, self).reindex(**kwargs)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\generic.py", line 3023, in reindex
    fill_value, copy).__finalize__(self)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\frame.py", line 2870, in _reindex_axes
    fill_value, limit, tolerance)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\frame.py", line 2881, in _reindex_index
    allow_dups=False)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\generic.py", line 3145, in _reindex_with_indexers
    copy=copy)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\internals.py", line 4139, in reindex_indexer
    self.axes[axis]._can_reindex(indexer)
  File "C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\indexes\base.py", line 2944, in _can_reindex
    raise ValueError("cannot reindex from a duplicate axis")

```ValueError: cannot reindex from a duplicate axis
* What did you expect to happen?
* What happened instead?

Here is how you can reproduce this issue on your machine:

## Reproduction Steps

1. 
2.
3.
na
## What steps have you taken to resolve this already?

NA

# Anything else?

NA

Sincerely,
Sabir Jana
sabirjana commented 4 years ago

Hi, I removed the csv files where I was getting problems and could able to create the bundle however not able to use it due to following error

`---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 852076800000000000

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-27-d3d0e133d2f3> in <module>
      7     capital_base=100000, # Set initial capital
      8     data_frequency = 'daily',  # Set data frequency
----> 9     bundle= 'india_stock_data' )#'random_equities') #'india_stock_data' )#'quandl') #'ac_equities_db' ) # Select bundle

C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\utils\run_algo.py in run_algorithm(start, end, initialize, capital_base, handle_data, before_trading_start, analyze, data_frequency, data, bundle, bundle_timestamp, trading_calendar, metrics_set, default_extension, extensions, strict_extensions, environ, blotter)
    428         local_namespace=False,
    429         environ=environ,
--> 430         blotter=blotter,
    431     )

C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\utils\run_algo.py in _run(handle_data, initialize, before_trading_start, analyze, algofile, algotext, defines, data_frequency, capital_base, data, bundle, bundle_timestamp, start, end, output, trading_calendar, print_algo, metrics_set, local_namespace, environ, blotter)
    167             equity_minute_reader=bundle_data.equity_minute_bar_reader,
    168             equity_daily_reader=bundle_data.equity_daily_bar_reader,
--> 169             adjustment_reader=bundle_data.adjustment_reader,
    170         )
    171 

C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\zipline\data\data_portal.py in __init__(self, asset_finder, trading_calendar, first_trading_day, equity_daily_reader, equity_minute_reader, future_daily_reader, future_minute_reader, adjustment_reader, last_available_session, last_available_minute, minute_history_prefetch_length, daily_history_prefetch_length)
    289                 self._first_trading_day
    290             )
--> 291             if self._first_trading_day is not None else (None, None)
    292         )
    293 

C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\trading_calendars\trading_calendar.py in open_and_close_for_session(self, session_label)
    763         # http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#datetime-with-tz  # noqa
    764         return (
--> 765             sched.at[session_label, 'market_open'].tz_localize(UTC),
    766             sched.at[session_label, 'market_close'].tz_localize(UTC),
    767         )

C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1867 
   1868         key = self._convert_key(key)
-> 1869         return self.obj._get_value(*key, takeable=self._takeable)
   1870 
   1871     def __setitem__(self, key, value):

C:\ProgramData\Anaconda3\envs\env_zipline\lib\site-packages\pandas\core\frame.py in _get_value(self, index, col, takeable)
   1983 
   1984         try:
-> 1985             return engine.get_value(series._values, index)
   1986         except (TypeError, ValueError):
   1987 

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()

KeyError: Timestamp('1997-01-01 00:00:00+0000', tz='UTC')

my extension.py code looks as below

from zipline.data.bundles import register, india_stock_data

register(
    'india_stock_data',
    india_stock_data.india_stock_data,
    calendar_name='XBOM'
)   

india_stock_data.py code as follows

import pandas as pd
from os import listdir

# Change the path to where you have your data
path = 'C:\\Users\\sabirj\\Desktop\\P4Finance\\data'

"""
The ingest function needs to have this exact signature,
meaning these arguments passed, as shown below.
"""
def india_stock_data(environ,
                  asset_db_writer,
                  minute_bar_writer,
                  daily_bar_writer,
                  adjustment_writer,
                  calendar,
                  start_session,
                  end_session,
                  cache,
                  show_progress,
                  output_dir):

    # Get list of files from path
    # Slicing off the last part
    # 'example.csv'[:-4] = 'example'
    symbols = [f[:-4] for f in listdir(path)]

    if not symbols:
        raise ValueError("No symbols found in folder.")

    # Prepare an empty DataFrame for dividends
    divs = pd.DataFrame(columns=['sid', 
                                 'amount',
                                 'ex_date', 
                                 'record_date',
                                 'declared_date', 
                                 'pay_date']
    )

    # Prepare an empty DataFrame for splits
    splits = pd.DataFrame(columns=['sid',
                                   'ratio',
                                   'effective_date']
    )

    # Prepare an empty DataFrame for metadata
    metadata = pd.DataFrame(columns=('start_date',
                                              'end_date',
                                              'auto_close_date',
                                              'symbol',
                                              'exchange'
                                              )
                                     )

    # Check valid trading dates, according to the selected exchange calendar
    sessions = calendar.sessions_in_range(start_session, end_session)

    # Get data for all stocks and write to Zipline
    daily_bar_writer.write(
            process_stocks(symbols, sessions, metadata, divs)
            )

    # Write the metadata
    asset_db_writer.write(equities=metadata)

    # Write splits and dividends
    adjustment_writer.write(splits=splits,
                            dividends=divs)    

"""
Generator function to iterate stocks,
build historical data, metadata 
and dividend data
"""
def process_stocks(symbols, sessions, metadata, divs):
    # Loop the stocks, setting a unique Security ID (SID)
    for sid, symbol in enumerate(symbols):

        print('Loading {}...'.format(symbol))
        # Read the stock data from csv file.
        df = pd.read_csv('{}/{}.csv'.format(path, symbol), index_col=[0], parse_dates=[0]) 

        # Check first and last date.
        start_date = df.index[0]
        end_date = df.index[-1]        

        # Synch to the official exchange calendar
        df = df.reindex(sessions.tz_localize(None))[start_date:end_date] #tz_localize(None)

        # Forward fill missing data
        df.fillna(method='ffill', inplace=True)

        # Drop remaining NaN
        df.dropna(inplace=True)    

        # The auto_close date is the day after the last trade.
        ac_date = end_date + pd.Timedelta(days=1)

        # Add a row to the metadata DataFrame. Don't forget to add an exchange field.
        metadata.loc[sid] = start_date, end_date, ac_date, symbol, "XBOM"

        # If there's dividend data, add that to the dividend DataFrame
        if 'dividend' in df.columns:

            # Slice off the days with dividends
            tmp = df[df['dividend'] != 0.0]['dividend']
            div = pd.DataFrame(data=tmp.index.tolist(), columns=['ex_date'])

            # Provide empty columns as we don't have this data for now
            div['record_date'] = pd.NaT
            div['declared_date'] = pd.NaT
            div['pay_date'] = pd.NaT            

            # Store the dividends and set the Security ID
            div['amount'] = tmp.tolist()
            div['sid'] = sid

            # Start numbering at where we left off last time
            ind = pd.Index(range(divs.shape[0], divs.shape[0] + div.shape[0]))
            div.set_index(ind, inplace=True)

            # Append this stock's dividends to the list of all dividends
            divs = divs.append(div)    

        yield sid, df