scrtlabs / catalyst

An Algorithmic Trading Library for Crypto-Assets in Python
http://enigma.co
Apache License 2.0
2.49k stars 725 forks source link

Ingest Data Errors when specifying early starting date. #47

Closed avn3r closed 6 years ago

avn3r commented 6 years ago

Dear Catalyst Maintainers,

Before I tell you about my issue, let me describe my environment:

Environment

Description of Issue

  1. Ingest Data not being load all markets when selecting an early starting date like 2017-01-01. If I select 2017-09-01 all markets work as expected. I have manually filters pairs that are not in the current change of the backtest to ensure that data for that day in catalyst existed.

BTC Market works for both starting dates, but XMR, eth, usdt markets are not working for 2017-01-01 similar to the previous issues in v0.3.2

I have only tested in Poloniex but my script can be used with any exchange by changing context.exchange so feel free to use this script as a unit test.

Code

"""
Requires Catalyst version 0.3.0 or above
Tested on Catalyst version 0.3.2

These example aims to provide and easy way for users to learn how to collect data from the different exchanges.
You simply need to specify the exchange and the market that you want to focus on.
You will all see how to create a universe and filter it base on the exchange and the market you desire.

The example prints out the closing price of all the pairs for a given market-exchange every 30 minutes.
The example also contains the ohlcv minute data for the past seven days which could be used to create indicators
Use this as the backbone to create your own trading strategies.

Variables lookback date and date are used to ensure data for a coin existed on the lookback period specified.
"""

import numpy as np
import pandas as pd
from datetime import timedelta
from catalyst import run_algorithm
from catalyst.exchange.exchange_utils import get_exchange_symbols

from catalyst.api import (
    symbols,
)

def initialize(context):
    context.i = -1  # counts the minutes
    context.exchange = 'poloniex'  # must match the exchange specified in run_algorithm
    context.base_currency = 'eth'  # must match the base currency specified in run_algorithm

def handle_data(context, data):
    lookback = 60 * 24 * 7  # (minutes, hours, days) of how far to lookback in the data history
    context.i += 1

    # current date formatted into a string
    today = context.blotter.current_dt
    date, time = today.strftime('%Y-%m-%d %H:%M:%S').split(' ')
    lookback_date = today - timedelta(days=(lookback / (60 * 24)))  # subtract the amount of days specified in lookback
    lookback_date = lookback_date.strftime('%Y-%m-%d %H:%M:%S').split(' ')[0]  # get only the date as a string

    # update universe everyday
    new_day = 60 * 24
    if not context.i % new_day:
        context.universe = universe(context, lookback_date, date)

    # get data every 30 minutes
    minutes = 30
    if not context.i % minutes and context.universe:
        # we iterate for every pair in the current universe
        for coin in context.coins:
            pair = str(coin.symbol)

            # 30 minute interval ohlcv data (the standard data required for candlestick or indicators/signals)
            # 30T means 30 minutes re-sampling of one minute data. change to your desire time interval.
            open = fill(data.history(coin, 'open', bar_count=lookback, frequency='1m')).resample('30T').first()
            high = fill(data.history(coin, 'high', bar_count=lookback, frequency='1m')).resample('30T').max()
            low = fill(data.history(coin, 'low', bar_count=lookback, frequency='1m')).resample('30T').min()
            close = fill(data.history(coin, 'price', bar_count=lookback, frequency='1m')).resample('30T').last()
            volume = fill(data.history(coin, 'volume', bar_count=lookback, frequency='1m')).resample('30T').sum()

            # close[-1] is the equivalent to current price
            # displays the minute price for each pair every 30 minutes
            print(today, pair, open[-1], high[-1], low[-1], close[-1], volume[-1])

            # ----------------------------------------------------------------------------------------------------------
            # -------------------------------------- Insert Your Strategy Here -----------------------------------------
            # ----------------------------------------------------------------------------------------------------------

def analyze(context=None, results=None):
    pass

# Get the universe for a given exchange and a given base_currency market
# Example: Poloniex BTC Market
def universe(context, lookback_date, current_date):
    json_symbols = get_exchange_symbols(context.exchange)  # get all the pairs for the exchange
    universe_df = pd.DataFrame.from_dict(json_symbols).transpose().astype(str)  # convert into a dataframe
    universe_df['base_currency'] = universe_df.apply(lambda row: row.symbol.split('_')[1],
                                                                       axis=1)
    universe_df['market_currency'] = universe_df.apply(lambda row: row.symbol.split('_')[0],
                                                                         axis=1)
    # Filter all the exchange pairs to only the ones for a give base currency
    universe_df = universe_df[universe_df['base_currency'] == context.base_currency]

    # Filter all the pairs to ensure that pair existed in the current date range
    universe_df = universe_df[universe_df.start_date < lookback_date]
    universe_df = universe_df[universe_df.end_daily >= current_date]
    context.coins = symbols(*universe_df.symbol)  # convert all the pairs to symbols
    print(universe_df.head(), len(universe_df))
    return universe_df.symbol.tolist()

# Replace all NA, NAN or infinite values with its nearest value
def fill(series):
    if isinstance(series, pd.Series):
        return series.replace([np.inf, -np.inf], np.nan).ffill().bfill()
    elif isinstance(series, np.ndarray):
        return pd.Series(series).replace([np.inf, -np.inf], np.nan).ffill().bfill().values
    else:
        return series

if __name__ == '__main__':
    start_date = pd.to_datetime('2017-01-01', utc=True)
    end_date = pd.to_datetime('2017-10-15', utc=True)

    performance = run_algorithm(start=start_date, end=end_date,
                                capital_base=10000.0,
                                initialize=initialize,
                                handle_data=handle_data,
                                analyze=analyze,
                                exchange_name='poloniex',
                                data_frequency='minute',
                                base_currency='eth',
                                live=False,
                                live_graph=False,
                                algo_namespace='simple_universe')

"""
Run in Terminal (inside catalyst environment):
python simple_universe.py
"""

Error

[2017-10-26 23:14:30.107384] INFO: exchange_bundle: pricing data for [u'bcn_xmr'] not found in range 1989-05-28 00:00:00+00:00 to 2017-01-01 00:00:00+00:00, updating the bundles.
    [====================================]  Fetching poloniex daily candles: :  100%
Error traceback: /home/avn3r/applications/conda/envs/catalyst/lib/python2.7/site-packages/catalyst/exchange/exchange_bundle.py (line 676)
PricingDataNotLoadedError:  Pricing data open for trading pairs bcn_xmr trading on exchange poloniex since 2014-07-23 00:00:00+00:00 is unavailable. The bundle data is either out-of-date or has not been loaded yet. Please ingest data using the command `catalyst ingest-exchange -x poloniex -f daily -i bcn_xmr`. See catalyst documentation for details.
Sincerely,

avn3r

fredfortier commented 6 years ago

I was able to reproduce this issue. Basically, using Jan 1st as the start date looks for data in the previous period. Using Jan 2nd does not. Working on a fix.

fredfortier commented 6 years ago

Here is at least part of the issue. Even though your algo starts on 2017-01-01, you are calling data.history() which a number of bars from the current date. Your bar_count is 10080. This explains why the algo would attempt to retrieve 2016 data. Note, when you don't specify a data_frequency parameter, it uses the data frequency of your algo, minute in this case.

Now, that's an explanation of why it fetches data in 2016, not why it fails to obtain it. I'm investigating this further.

avn3r commented 6 years ago

Not understanding what you mention of bar_count parameter not being specified. I clearly specified bar_count=lookback and frequency='1m' and data_frequency='minute' where lookback is 7 days of history. So yeah I look at the last 7 days of 2016 to make predictions of 2017-01-01 but it should still try to get 2016 data because I ensure that only pairs that existed prior to 7 days before current date are part of the universe.

avn3r commented 6 years ago

However, as you mention the error has to do with retrieving 2016 data. I confirmed all poloniex markets are working if I specify 2017-01-08 as my starting date so all my history data is in the 2017.

fredfortier commented 6 years ago

"Not understanding what you mention of bar_count parameter not being specified. " - I apologize, I modified my comment after looking more into your code.

avn3r commented 6 years ago

given your feedback, saw I was specifying the wrong range for catalyst ingest.

If I manually ingest: catalyst ingest-exchange -x poloniex -f minute -s 2016-12-01

It now works. So the issue can be narrow down to when data is automatically ingested.

fredfortier commented 6 years ago

I'm not sure that I fixed this issue yet but I'm making the following change: when request data.history(), I will modify the range to use the end date of the algo. This will ensure that the algo retrieves historical data only once per market.

fredfortier commented 6 years ago

This should be fixed. Here is what it looks like on my side: https://www.dropbox.com/s/vh6h2digwbxy4h7/issue_47.mp4?dl=0

Feel free to re-open if you are still experiencing issues with release 0.3.4.

fredfortier commented 6 years ago

Re-opening to give enough time for validation.

avn3r commented 6 years ago

Ok I will confirmed once 0.3.4 is out.

avn3r commented 6 years ago

It did not pass my validation.

I pmed you with details, but basically, if I specify -s 2017-01-01 during manual ingest of data it works since my start_date=2017-01-08. However, a new bug occurs if I select start_data=2016-11-01 where it forgets to fetch older data that I have not manually ingest yet. Since it couldn't find the older data it prints all nan values.

fredfortier commented 6 years ago

I'm not sure if it's the same issue, but I did find something. I came across a use case where it was periodically trying to ingest data for some bars even after running the same algo multiple times.

Here is the error message:

handling bar: 2017-06-01 23:59:00+00:00
got price 0.091554
[2017-11-03 23:49:41.897220] INFO: exchange_bundle: pricing data for [u'eth_btc'] not found in range 2017-05-31 22:40:00+00:00 to 2017-06-01 00:00:00+00:00, updating the bundles.
    [====================================]  Ingesting minute price data for eth_btc on bitfinex:  100%
Pricing data close for trading pairs eth_btc trading on exchange bitfinex since 2016-03-09 00:00:00+00:00 is unavailable. The bundle data is either out-of-date or has not been loaded yet. Please ingest data using the command `catalyst ingest-exchange -x bitfinex -f minute -i eth_btc`. See catalyst documentation for details.

I dumped the bundle into a csv file and found this"

2017-05-31   23:56:00+00:00 | 0.050334 | 0.050334 | 0.050334 | 0.050334 | 0
-- | -- | -- | -- | -- | --
2017-05-31 23:57:00+00:00 | 0.050334 | 0.050334 | 0.050334 | 0.050334 | 0
2017-05-31 23:58:00+00:00 | 0.050334 | 0.050334 | 0.050334 | 0.050334 | 0
2017-05-31 23:59:00+00:00 | 0.050334 | 0.050334 | 0.050334 | 0.050334 | 0
2017-06-01 00:00:00+00:00 |   |   |   |   | 0
2017-06-01 00:01:00+00:00 | 0.09981799 | 0.09981799 | 0.09981799 | 0.09981799 | 2.25752429
2017-06-01 00:02:00+00:00 | 0.09981799 | 0.09981799 | 0.09981799 | 0.09981799 | 0

There is an empty row on 2017-06-01 which seem to explain the error. I'm investing this now to determine the root cause.

fredfortier commented 6 years ago

I was able to simulate a behavior similar to what's described here by starting an ingest-exchange job, killing it halfway and then attempting auto-ingestion.

avn3r commented 6 years ago

The first one you mentioned is with respect to bitfinex. I got same error but it's not the one I reported. I have only gotten data to load properly on poloniex.

I did kill it the first time, but I made sure to clean it and redownload. I tested it on Poloniex BTC market. I ingested data from 2017-01-01 to 2017-10-16 and start date is 2017-01-08 on script and works as expected. But the changing the start date to 2016-11-01 throws all nan and doesn't try to ingest that data as expected.

fredfortier commented 6 years ago

I believe that this issue is still reproducible under some auto-ingest conditions. We've prioritize this an working towards a resolution now.

avn3r commented 6 years ago

Yes, I still able to reproduce both errors discussed on v0.3.6

1) Bitfinex minute data eth_btc producing ingestion error. This is error occur even when manually ingesting the data. Therefore Bitfinex is still not working as expected since 0.3.X.

2) Manually ingesting data but requesting even earlier data: Example: catalyst clean-exchange -x poloniex catalyst ingest-exchange -x poloniex -f minute -s 2017-01-01 -e 2017-10-30 Run simple_universe.py (my script in PR). Range: start_date=2017-01-08, end_date=2017-10-15 <-- This works Range: start_date=2016-12-30, end_date=2017-10-15 <-- Gives Error

Error: All nan values imprinted. This data didn't exist and it should have tried and ingest it, but never did.

fredfortier commented 6 years ago

I just noticed an important detail in your earlier comment:

Bitfinex minute data eth_btc producing ingestion error.

How do you reproduce this particular condition? This command seems to work well for me:

catalyst ingest-exchange -x bitfinex -i eth_btc -f minute 

It's possible that the issue was resolved by changes to the bundle related to issue #54.

I'm investigating the other conditions.

fredfortier commented 6 years ago

I made two adjustments which seem to address NaN issues when auto-ingesting on top of partially available data:

  1. Re-instantiate the bcolz bar reader: The bar reader object keeps some data in memory which seemed to be a problem only when ingesting data to currency pairs already partially in the main bundle.
  2. Re-download bad temp bundles: I have noticed that aborting an ingestion job sometimes result in a bad bundle in the temp_bundles folder. We now anticipate this condition and replace the bundle when needed.

I'm now investigating this condition more closely: "Manually ingesting data but requesting even earlier data".

avn3r commented 6 years ago

With respect to error one eth_btc i just meant the error you reported.

You:

I'm not sure if it's the same issue, but I did find something. I came across a use case where it was periodically trying to ingest data for some bars even after running the same algo multiple times.

Here is the error message:

handling bar: 2017-06-01 23:59:00+00:00
got price 0.091554
[2017-11-03 23:49:41.897220] INFO: exchange_bundle: pricing data for [u'eth_btc'] not found in range 2017-05-31 22:40:00+00:00 to 2017-06-01 00:00:00+00:00, updating the bundles.
    [====================================]  Ingesting minute price data for eth_btc on bitfinex:  100%
Pricing data close for trading pairs eth_btc trading on exchange bitfinex since 2016-03-09 00:00:00+00:00 is unavailable. The bundle data is either out-of-date or has not been loaded yet. Please ingest data using the command `catalyst ingest-exchange -x bitfinex -f minute -i eth_btc`. See catalyst documentation for details.
I dumped the bundle into a csv file and found this"

2017-05-31   23:56:00+00:00 | 0.050334 | 0.050334 | 0.050334 | 0.050334 | 0
-- | -- | -- | -- | -- | --
2017-05-31 23:57:00+00:00 | 0.050334 | 0.050334 | 0.050334 | 0.050334 | 0
2017-05-31 23:58:00+00:00 | 0.050334 | 0.050334 | 0.050334 | 0.050334 | 0
2017-05-31 23:59:00+00:00 | 0.050334 | 0.050334 | 0.050334 | 0.050334 | 0
2017-06-01 00:00:00+00:00 |   |   |   |   | 0
2017-06-01 00:01:00+00:00 | 0.09981799 | 0.09981799 | 0.09981799 | 0.09981799 | 2.25752429
2017-06-01 00:02:00+00:00 | 0.09981799 | 0.09981799 | 0.09981799 | 0.09981799 | 0
There is an empty row on 2017-06-01 which seem to explain the error. I'm investing this now to determine the root cause.
fredfortier commented 6 years ago

Waiting for #54 and #53 to complete validation.

fredfortier commented 6 years ago

After even more testing, I still observed instances of NaN entries during auto-ingestion. I believe that it's a caching issue with the writer and reader but it's hard to pinpoint the exact root cause. We may consider disabling auto-ingestion temporally as these issues don't seem to occur when populating the bundles separately.

avn3r commented 6 years ago

No worries.

Just make sure to document about manual ingest on the documentation for both Installation and beginner tutorial so people know they first have to ingest and how should they ingest. Current documentation doesn't talk much about manual ingesting besides the parameters available.

avn3r commented 6 years ago

...

avn3r commented 6 years ago

Seems all error have been fixed with 0.3.8+. Feel free to go ahead and close this issue.

lenak25 commented 6 years ago

@abnera , closing this issue with regards to your last comment.