quantopian / zipline

Zipline, a Pythonic Algorithmic Trading Library
https://www.zipline.io
Apache License 2.0
17.58k stars 4.72k forks source link

Minute Bar Ingest Failing #2344

Closed marketneutral closed 5 years ago

marketneutral commented 5 years ago

Can you provide some high level guidance and any "gotchas" you may be aware of for the ingestion of minute bar data?

TL;DR: stepping though ingest, all bcolz steps looks good; data.current(...) produces nan.

Input Data

I have .csv data in the form

symbol,end_time,open,high,low,close,change,settle,volume,prev_day_open_int
TUZ2018,2018-08-29 12:01:00+00:00,105.578125,105.578125,105.578125,105.578125,0.0,105.578125,1325,1325.3283333333334
TUZ2018,2018-08-29 12:02:00+00:00,105.578125,105.578125,105.578125,105.578125,0.0,105.578125,1325,1325.3283333333334
TUZ2018,2018-08-29 12:03:00+00:00,105.578125,105.578125,105.578125,105.578125,0.0,105.578125,1325,1325.3283333333334

Ingest Function

I have a working ingest function which registered as

register(                                                                               
    'minute',                                                                           
    csvdir_futures(                                                                     
        ['minute'],                                                                     
        MINUTE_DIR,                                                                     
        False                                                                           
    ),                                                                                  
    start_session=pd.Timestamp('2018-08-29', tz='utc'),                                 
    end_session=pd.Timestamp('2018-09-21', tz='utc'),                                   
    calendar_name='us_futures',                                                         
    minutes_per_day=1440                                                                
)

which includes

    for tframe in tframes:                                                              
        if tframe == 'minute':                                                          
            writer = minute_bar_writer                                                  
            sessions = calendar.minutes_for_sessions_in_range(                          
                start_session, end_session)                                             
        else:                                                                           
            sessions = calendar.sessions_in_range(start_session, end_session)           
            writer = daily_bar_writer                                                   

        writer.write(                                                                   
            parse_pricing_and_vol(                                                      
                raw_data,                                                               
                sessions,                                                               
                symbol_map                                                              
            ),                                                                          
            show_progress=True                                                          
        ) 

where

def parse_pricing_and_vol(data,                                                         
                          sessions,                                                     
                          symbol_map):                                                  
    import pdb; pdb.set_trace()                                                         
    for asset_id, symbol in iteritems(symbol_map):                                      
        asset_data = data.xs(                                                           
            symbol,                                                                     
            level=1                                                                     
        ).reindex(                                                                      
            sessions                                                                    
        ).fillna(0.0)                                                                   
        yield asset_id, asset_data

does indeed yield the proper ticker and asset table, inspected by pdb as

                           open  high  low  close  change  settle  volume  \
2018-08-28 22:01:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0   
2018-08-28 22:02:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0   
2018-08-28 22:03:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0   
2018-08-28 22:04:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0   
2018-08-28 22:05:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0

No worries that there are zeros, just key that there is data and is it not NaN.

write the bcolz table

The bcolz writer here is getting a valid generator

(Pdb) data
<generator object parse_pricing_and_vol at 0x7fe3a0ed89e8>

and the generator yields good data (note that the sid is 1).

(Pdb) e
(1                            open  high  low  close  change  settle  volume  \
2018-08-28 22:01:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0   
2018-08-28 22:02:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0   
2018-08-28 22:03:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0   
2018-08-28 22:04:00+00:00   0.0   0.0  0.0    0.0     0.0     0.0     0.0

write_sid --> _write_cols

Writing the bcolz files here looks good.

(Pdb) all_minutes
DatetimeIndex(['2018-08-28 22:01:00+00:00', '2018-08-28 22:02:00+00:00',
               '2018-08-28 22:03:00+00:00', '2018-08-28 22:04:00+00:00',
               '2018-08-28 22:05:00+00:00', '2018-08-28 22:06:00+00:00',
               '2018-08-28 22:07:00+00:00', '2018-08-28 22:08:00+00:00',
               '2018-08-28 22:09:00+00:00', '2018-08-28 22:10:00+00:00',
               ...
               '2018-09-21 21:51:00+00:00', '2018-09-21 21:52:00+00:00',
               '2018-09-21 21:53:00+00:00', '2018-09-21 21:54:00+00:00',
               '2018-09-21 21:55:00+00:00', '2018-09-21 21:56:00+00:00',
               '2018-09-21 21:57:00+00:00', '2018-09-21 21:58:00+00:00',
               '2018-09-21 21:59:00+00:00', '2018-09-21 22:00:00+00:00'],
              dtype='datetime64[ns, UTC]', length=25920, freq=None)

matches

(Pdb) dts
array(['2018-08-28T22:01:00.000000000', '2018-08-28T22:02:00.000000000',
       '2018-08-28T22:03:00.000000000', ...,
       '2018-09-21T21:58:00.000000000', '2018-09-21T21:59:00.000000000',
       '2018-09-21T22:00:00.000000000'], dtype='datetime64[ns]')

and the table looks good

(Pdb) table
ctable((25920,), [('open', '<u4'), ('high', '<u4'), ('low', '<u4'), ('close', '<u4'), ('volume', '<u4')])
  nbytes: 506.25 KB; cbytes: 1.25 MB; ratio: 0.40
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := '/tmp/tmps_m3mmrp/minute/2018-10-25T19;26;03.662318/minute_equities.bcolz/00/00/000001.bcolz'
[(0, 0, 0, 0, 0) (0, 0, 0, 0, 0) (0, 0, 0, 0, 0) ..., (0, 0, 0, 0, 0)
 (0, 0, 0, 0, 0) (0, 0, 0, 0, 0)]

and the writing completes without error.

Bundle Inspection

The security master looks good:

5T20;02;45.580551$ sqlite3 assets-6.sqlite 
SQLite version 3.25.2 2018-09-25 19:08:10
Enter ".help" for usage hints.
sqlite> .tables
asset_router                   futures_contracts            
equities                       futures_exchanges            
equity_supplementary_mappings  futures_root_symbols         
equity_symbol_mappings         version_info                 
sqlite> select * from futures_contracts ;
0|FVZ2018|FV||1535630460000000000|1537480800000000000|-9223372036854775808|EXCH|1537567200000000000|1545350400000000000|1537567200000000000|1000.0|0.0001
1|TUZ2018|TU||1535544060000000000|1537480800000000000|-9223372036854775808|EXCH|1537567200000000000|1545350400000000000|1537567200000000000|2000.0|0.0001
2|TYZ2018|TY||1535716860000000000|1537480800000000000|-9223372036854775808|EXCH|1537567200000000000|1545350400000000000|1537567200000000000|1000.0|0.0001
3|USZ2018|US||1535716860000000000|1537480800000000000|-9223372036854775808|EXCH|1537567200000000000|1545350400000000000|1537567200000000000|1000.0|0.0001

The metadata.json in the zipline_root/data/minute/2018-10-25T20;02;45.580551/minute_equities.bcolz looks fine:

{"market_closes": [25593000, 25594440, 25595880, 25600200, 25601640, 25603080, 25604520\
, 25605960, 25610280, 25611720, 25613160, 25614600, 25616040, 25620360, 25621800, 25623\
240, 25624680, 25626120], "market_opens": [25591561, 25593001, 25594441, 25598761, 2560\
0201, 25601641, 25603081, 25604521, 25608841, 25610281, 25611721, 25613161, 25614601, 2\
5618921, 25620361, 25621801, 25623241, 25624681], "minutes_per_day": 1440, "start_sessi\
on": "2018-08-29", "ohlc_ratio": 1000, "ohlc_ratios_per_sid": null, "first_trading_day"\
: "2018-08-29", "end_session": "2018-09-21", "calendar_name": "us_futures", "version": \
3}  

and it looks like there is a table for each sid; a ls in the zipline_root/data/minute/2018-10-25T20;02;45.580551/minute_equities.bcolz/00/00 gives

drwxrwxr-x 7 jlarkin jlarkin 9 Oct 25 20:02 000000.bcolz
drwxrwxr-x 7 jlarkin jlarkin 9 Oct 25 20:02 000001.bcolz
drwxrwxr-x 7 jlarkin jlarkin 9 Oct 25 20:02 000002.bcolz
drwxrwxr-x 7 jlarkin jlarkin 9 Oct 25 20:02 000003.bcolz

Accessing Minute Data in an Algo

Running the bare minimum algo with

zipline run -f repro.py -s 2018-08-29 -e 2018-08-30 -b minute --data-frequency minute --trading-calendar us_futures
def initialize(context):                                                                
    context.my_future = future_symbol('TUZ2018')                                        
    log.info(context.my_future)                                                         

def handle_data(context, data):                                                         
    log.info(get_datetime('US/Eastern'))                                                
    log.info(data.current(context.my_future, 'close')) 

produces

[19:55:56.477316]: INFO: initialize: Future(1 [TUZ2018])
[19:55:56.489192]: INFO: handle_data: 2018-08-29 06:31:00-04:00
[19:55:56.489810]: INFO: handle_data: nan
[19:55:56.489972]: INFO: handle_data: 2018-08-29 06:32:00-04:00
[19:55:56.490135]: INFO: handle_data: nan
[19:55:56.490279]: INFO: handle_data: 2018-08-29 06:33:00-04:00
[19:55:56.490436]: INFO: handle_data: nan
[19:55:56.490579]: INFO: handle_data: 2018-08-29 06:34:00-04:00
[19:55:56.490732]: INFO: handle_data: nan

So, I am getting NaN for all prices, even it seems like, at least, 0.0 is all there for every minute in session.

Any pointers/guidance at all would be greatly appreciated. Thank you. 😃 📈

llllllllll commented 5 years ago

To reduce the storage size and improve compression, we actually store prices as unsigned int32 values. For all fields except for volume we multiply through by 1000 and then round. This is sufficient precision for US equities and futures. You can see that the ctable shows that the fields are u4:

ctable((25920,), [('open', '<u4'), ('high', '<u4'), ('low', '<u4'), ('close', '<u4'), ('volume', '<u4')])

Integers and unsigned integers do not have a native missing value, so we have reserved 0 to be the missing value for the data. The reader does this conversion here: https://github.com/quantopian/zipline/blob/master/zipline/data/minute_bars.py#L1141. The assumption is that no asset could have a price of 0, but that might not actually be correct here. Was this just to test the ingestion, or do these prices hit 0?

We should probably add a guard in the writing that says that you cannot set these values to 0. We expect users to provide NaN when it is missing, and we will convert on our own.

tl;dr: price of 0 is translated to NaN by the reader

marketneutral commented 5 years ago

Thank you @llllllllll ... that was indeed the issue!!! 🤕I was not intentionally writing zeros; I am looking at that now.

marketneutral commented 5 years ago

tz issue...fixed. Works!!! Thank you.

[20:57:44.587906]: INFO: initialize: Future(1 [TUZ2018])
[20:57:44.601528]: INFO: handle_data: 2018-09-04 06:31:00-04:00
[20:57:44.616197]: INFO: handle_data: 105.625
[20:57:44.616372]: INFO: handle_data: 2018-09-04 06:32:00-04:00
[20:57:44.616570]: INFO: handle_data: 105.625
[20:57:44.616724]: INFO: handle_data: 2018-09-04 06:33:00-04:00
[20:57:44.616909]: INFO: handle_data: 105.625