quantopian / zipline

Zipline, a Pythonic Algorithmic Trading Library
https://www.zipline.io
Apache License 2.0
17.56k stars 4.71k forks source link

Issue ingesting daily and minute data: sqlite3.IntegrityError: UNIQUE constraint failed: equities.sid #2376

Open brettelliot opened 5 years ago

brettelliot commented 5 years ago

Dear Zipline Maintainers,

Before I tell you about my issue, let me describe my environment:

Environment

* Operating System: macos 10.14.1 * Python Version: 3.5.6 * How did you install Zipline: pip * Python packages: alembic==1.0.3 appnope==0.1.0 backcall==0.1.0 bcolz==0.12.1 bleach==3.0.2 Bottleneck==1.2.1 certifi==2018.10.15 chardet==3.0.4 Click==7.0 contextlib2==0.5.5 cycler==0.10.0 cyordereddict==1.0.0 Cython==0.29.1 decorator==4.3.0 defusedxml==0.5.0 empyrical==0.5.0 entrypoints==0.2.3 idna==2.7 intervaltree==2.1.0 ipykernel==5.1.0 ipython==7.1.1 ipython-genutils==0.2.0 ipywidgets==7.4.2 jedi==0.13.1 Jinja2==2.10 jsonschema==2.6.0 jupyter==1.0.0 jupyter-client==5.2.3 jupyter-console==6.0.0 jupyter-core==4.4.0 kiwisolver==1.0.1 Logbook==1.4.1 lru-dict==1.1.6 lxml==4.2.5 Mako==1.0.7 MarkupSafe==1.1.0 matplotlib==3.0.2 mistune==0.8.4 multipledispatch==0.6.0 nbconvert==5.4.0 nbformat==4.4.0 networkx==1.11 notebook==5.7.2 numexpr==2.6.8 numpy==1.15.4 pandas==0.22.0 pandas-datareader==0.7.0 pandocfilters==1.4.2 parso==0.3.1 patsy==0.5.1 pexpect==4.6.0 pickleshare==0.7.5 Pillow==5.3.0 prometheus-client==0.4.2 prompt-toolkit==2.0.7 ptyprocess==0.6.0 pyEX==0.1.5 Pygments==2.3.0 pyparsing==2.3.0 python-dateutil==2.7.5 python-editor==1.0.3 pytz==2018.7 pyzmq==17.1.2 qtconsole==4.4.3 requests==2.20.1 requests-file==1.4.3 scipy==1.1.0 Send2Trash==1.5.0 six==1.11.0 socketIO-client-nexus==0.7.6 sortedcontainers==2.1.0 SQLAlchemy==1.2.14 statsmodels==0.9.0 tables==3.4.4 terminado==0.8.1 testpath==0.4.2 toolz==0.9.0 tornado==5.1.1 trading-calendars==1.5.1 traitlets==4.3.2 urllib3==1.24.1 wcwidth==0.1.7 webencodings==0.5.1 websocket-client==0.54.0 widgetsnbextension==3.4.2 wrapt==1.10.11 zipline==1.3.0

Now that you know a little about me, let me tell you about the issue I am having:

Description of Issue

sqlite3.IntegrityError: UNIQUE constraint failed: equities.sid

Here is how you can reproduce this issue on your machine:

Reproduction Steps

  1. Create a directory structure that has some daily and minute data from IEX. This data is attached here: iex.zip
./iex
./iex/minute
./iex/minute/A.csv
./iex/minute/AADR.csv
./iex/minute/AAC.csv
./iex/minute/AABA.csv
./iex/minute/AAMC.csv
./iex/minute/AAAU.csv
./iex/minute/AAN.csv
./iex/minute/AAME.csv
./iex/minute/AAL.csv
./iex/minute/AA.csv
./iex/daily
./iex/daily/A.csv
./iex/daily/AADR.csv
./iex/daily/AAC.csv
./iex/daily/AABA.csv
./iex/daily/AAMC.csv
./iex/daily/AAAU.csv
./iex/daily/AAN.csv
./iex/daily/AAME.csv
./iex/daily/AAL.csv
./iex/daily/AA.csv
  1. Put these lines into ~/.zipline/extension.py:
    from zipline.data.bundles import register, csvdir
    register('my-iex-csvdir', csvdir.csvdir_equities(['daily', 'minute']))
  2. Attempt to ingest:

CSVDIR=./iex/ zipline ingest -b my-iex-csvdir ...

What steps have you taken to resolve this already?

I can successfully ingest the daily data by it self with this extension.py:

register('my-iex-csvdir', csvdir.csvdir_equities(['daily']))

and I can successfully ingest the minute data by it self with this extension.py:

register('my-iex-csvdir', csvdir.csvdir_equities(['minute']))

But those changes don't fix anything. They just prove my data can be ingested individually. There seems to be a problem with the csvdir bundle when trying to ingest daily and minute data OR theres a problem with my data, process, extension.py, etc that I just don't see. ...

Anything else?

Thanks for the support!

Here's the whole call stack:

(maroma-e0011) ➜  e0011 git:(be-feature) ✗ CSVDIR=../../shared_data/iex/ zipline ingest -b my-iex-csvdir
 | A: sid 0
 | AA: sid 1
 | AAAU: sid 2
 | AABA: sid 3
 | AAC: sid 4
 | AADR: sid 5
 | AAL: sid 6
 | AAMC: sid 7
 | AAME: sid 8
 | AAN: sid 9
 | A: sid 0
 | AA: sid 1
 | AAAU: sid 2
 | AABA: sid 3
Loading custom pricing data:   [##############----------------------]   40% | AAC: sid 4
Loading custom pricing data:   [##################------------------]   50% | AADR: sid 5
Loading custom pricing data:   [#####################---------------]   60% | AAL: sid 6
Loading custom pricing data:   [#########################-----------]   70%  00:00:00 | AAMC: sid 7
Loading custom pricing data:   [############################--------]   80%  00:00:00 | AAME: sid 8
Loading custom pricing data:   [################################----]   90%  00:00:00 | AAN: sid 9
Loading custom pricing data:   [####################################]  100%
Merging minute equity files:  [####################################]
Traceback (most recent call last):
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/bin/zipline", line 11, in <module>
    sys.exit(main())
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/zipline/__main__.py", line 348, in ingest
    show_progress,
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/zipline/data/bundles/core.py", line 451, in ingest
    pth.data_path([name, timestr], environ=environ),
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/zipline/data/bundles/csvdir.py", line 94, in ingest
    self.csvdir)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/zipline/data/bundles/csvdir.py", line 163, in csvdir_bundle
    asset_db_writer.write(equities=metadata)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/zipline/assets/asset_writer.py", line 507, in write
    mapping_data=data.equities_mappings,
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/zipline/assets/asset_writer.py", line 562, in _write_assets
    self._write_df_to_table(tbl, assets, txn, chunk_size)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/zipline/assets/asset_writer.py", line 529, in _write_df_to_table
    chunksize=chunk_size,
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/pandas/core/generic.py", line 1534, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/pandas/io/sql.py", line 473, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/pandas/io/sql.py", line 1512, in to_sql
    table.insert(chunksize)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/pandas/io/sql.py", line 670, in insert
    self._execute_insert(conn, keys, chunk_iter)
  File "/Users/brettelliot/.virtualenvs/maroma-e0011/lib/python3.5/site-packages/pandas/io/sql.py", line 1299, in _execute_insert
    conn.executemany(self.insert_statement(), data_list)
sqlite3.IntegrityError: UNIQUE constraint failed: equities.sid

...

Sincerely, $ whoami

freddiev4 commented 5 years ago

@brettelliot I can take a look at this over the weekend

brettelliot commented 5 years ago

Thank you!

On Thu, Nov 29, 2018 at 1:47 PM Freddie Vargus notifications@github.com wrote:

@brettelliot https://github.com/brettelliot I can take a look at this over the weekend

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quantopian/zipline/issues/2376#issuecomment-442947901, or mute the thread https://github.com/notifications/unsubscribe-auth/AJbQ8Qr3i7y6pDegfG7KH8BYRXjd8Y0Mks5u0CvNgaJpZM4Y4a3P .

-- Brett Elliot

brettelliot commented 5 years ago

Hi @freddiev4 , did you find anything wrong with my data, or the way I was using CSVDIR? Thanks.

freddiev4 commented 5 years ago

Hey @brettelliot sorry I forgot to follow up here.

tl;dr you need to make two separate bundles (e.g. my-iex-csvdir-daily and my-iex-csvdir-minutely)

The reason for this is in zipline/data/bundles/csvdir.py line 134, we're just looping over the tframes variable, which contains strings saying what type of data you have (minutely and/or daily). We don't create a new metadata table (in SQL) for each of the tframes, so when we try to write the metadata, we get this sqlite3.IntegrityError: UNIQUE constraint failed: equities.sid or even `sqlite3.IntegrityError: UNIQUE constraint failed: exchanges.exchange.

In SQL a UNIQUE integrity constraint means that every value in a column or set of columns (key) must be unique, i.e. in one row in the equities table we have a column called sid, and the sid for, let's say A.csv, is mapped to 1.

Before:

equities Table
|   | sid | other_data | more_data | symbol |
|---|-----|------------|-----------|--------|
|   | 3   |            |           | AAL    |
|   |     |            |           |        |
|   |     |            |           |        |

This gets inserted into the table when we first build the daily data.

After

equities Table:
|   | sid | other_data | more_data | symbol |
|---|-----|------------|-----------|--------|
|   | 3   |            |           | AAL    |
|   | 1   |            |           |  A     |
|   |     |            |           |        |

Then, we try to build the minutely data, but we also have a file called A.csv for minutely data, for which we insert a sid of 1 into the equities table; but we already have an entry that looks exactly like that, which is where sqlite throws an exception.

The workaround for this is to make two separate data bundles, one for your daily data and one for your minutely data. A single bundle cannot contain both sets of data.

If any of that is unclear, let me know!

brettelliot commented 5 years ago

Thanks Freddie,

OK, I can easily make two bundles... but can zipline load two bundles at the same time? I guess the use case is... use the history methods to access both daily and minute data.

Thanks!

On Wed, Dec 5, 2018 at 12:42 PM Freddie Vargus notifications@github.com wrote:

Hey @brettelliot https://github.com/brettelliot sorry I forgot to follow up here.

tl;dr you need to make two separate bundles (e.g. my-iex-csvdir-daily and my-iex-csvdir-minutely

The reason for this is in zipline/data/bundles/csvdir.py line 134, we're just looping over the tframes variable, which contains strings saying what type of data you have (minutely and/or daily). We don't create a new metadata table (in SQL) for each of the tframes, so when we try to write the metadata, we get this sqlite3.IntegrityError: UNIQUE constraint failed: equities.sid or even `sqlite3.IntegrityError: UNIQUE constraint failed: exchanges.exchange.

In SQL a UNIQUE integrity constraint means that every value in a column or set of columns (key) must be unique.

Meaning that in one row in the equities table we have a column called sid, and the sid for, let's say A.csv is 1. This gets inserted into the table when we first build the daily data.

Then, we try to build the minutely data, but we also have a file called A.csv, for which we insert a sid of 1 into the equities table; but we already have an entry that looks exactly like that, which is where sqlite throws an exception.

The workaround for this is to make two separate data bundles, one for your daily data and one for your minutely data. A single bundle cannot contain both sets of data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quantopian/zipline/issues/2376#issuecomment-444575723, or mute the thread https://github.com/notifications/unsubscribe-auth/AJbQ8bHHvTN_IRyE_fcdHfiYyc4FRg1Uks5u2AV1gaJpZM4Y4a3P .

-- Brett Elliot

brettelliot commented 5 years ago

Hi Freddie,

The problem with only loading minute data is that zipline doesn't seem to be able to handle data.history calls when the frequency is "1d". For example this simple call crashes zipline:

daily_history = data.history(context.asset, 'price', bar_count=5, frequency="1d")

Is there a way to load two bundles at once? Is there a way to downsample minute data into daily data for the history function? (I actually downsampled the minute data to generate my daily data so I know it's easy to do. Just not sure how to do it in zipline).

If none of those are possible... I've posted a question to the zipline news group asking how to get data into zipline if I have a data api already (which I do). Perhaps you know how I can skip bundles entirely. If so, please respond either here on two the google group message. I'm really excited to backtest locally but I just can't seem to get my data in!

https://groups.google.com/d/msg/zipline/EkUf095bWs4/tMB7ejTQBgAJ

Thanks, Brett

pabx06 commented 5 years ago

zipline is dead

HamedShafiee commented 5 years ago

Hi Guys, any update on this? how we can have both daily and minute data in back test?

laucheukhim commented 4 years ago

@brettelliot To ingest both daily and minute data, you need to edit zipline/data/bundles/csvdir.py as follows:

for i, tframe in enumerate(tframes):

    ...

    if i == 0:
        # Hardcode the exchange to "CSVDIR" for all assets and (elsewhere)
        # register "CSVDIR" to resolve to the NYSE calendar, because these
        # are all equities and thus can use the NYSE calendar.
        metadata['exchange'] = "CSVDIR"

        asset_db_writer.write(equities=metadata)

    if tframe == 'daily':
        divs_splits['divs']['sid'] = divs_splits['divs']['sid'].astype(int)
        divs_splits['splits']['sid'] = divs_splits['splits']['sid'].astype(int)
        adjustment_writer.write(splits=divs_splits['splits'],
                                dividends=divs_splits['divs'])

The SQL error is caused by multiple calls to asset_db_writer.write() when you ingest both daily and minute data. There is no need to register the metadata twice.

There is also no need to register dividends and splits twice. Since there is another error ingesting dividends from minute data and splits are ignored from minute data, we only need to register them from daily data.

@freddiev4 Please help fix the csvdir bundle.

psinsf commented 4 years ago

Thanks for this! I can confirm that this fix allows ingestion and access of both minute and daily data for the same asset in a single bundle.