rsvp / fecon235

Notebooks for financial economics. Keywords: Jupyter notebook pandas Federal Reserve FRED Ferbus GDP CPI PCE inflation unemployment wage income debt Case-Shiller housing asset portfolio equities SPX bonds TIPS rates currency FX euro EUR USD JPY yen XAU gold Brent WTI oil Holt-Winters time-series forecasting statistics econometrics
https://git.io/econ
Other
1.13k stars 331 forks source link

pandas .resample() "how" deprecation as of its 0.19 version. Fix our daily(), monthly(), quarterly() #6

Closed rsvp closed 7 years ago

rsvp commented 7 years ago

Description of specific issue

When resampling a time-series the following warning(s) will appear:

FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).median() fill_method=None)

FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)

It is somewhat cryptic until one realizes how='median' was being used as an argument to the .resample function. So how becomes the problem for yi_fred module, specifically for our functions daily(), monthly(), and quarterly() in fecon235.

(Sidenote: how='median' since it is more robust than 'mean'.)

The second cryptic warning can be traced to our use of fill_method=None when upsampling. The new API urges us to instead use methods:


Expected behavior

No such warning, possibly fatal termination.

Observed behavior

Warnings started as of pandas 0.18

Why would the improvement be useful to most users?

Because daily(), weekly(), and monthly() in fecon235 should just work without the casual user needing to learn
obscure flags and methods (subject to future API changes).

Additional helpful details for bugs

rsvp commented 7 years ago

An immediate remedy is to downgrade to pandas 0.18.0 or 0.18.1 if you fatally encounter this issue.

The problem summarized: for pandas API > 0.18, you can either downsample OR upsample, but not both.

The prior API implementations would allow you to pass an aggregator function (e.g. mean) even though you were upsampling, providing a bit of confusion.

Thus fecon235 resampling functions which have been working under both upsampling and downsampling situations will break e.g. see yi_fred code.

So is there a pandas way to detect which type of sampling is being requested given the data argument? Otherwise, the fix may have to involve an additional mandatory flag, and tedious edits across many fecon235 notebooks.

rsvp commented 7 years ago

Key points in resolving this issue

pandas breaks previous API for resampling

Code which solves current issue

def index_delta_secs( dataframe ):
    '''Find minimum in seconds between index values.'''
    nanosecs_timedelta64 = np.diff(dataframe.index.values).min()
    #  Picked min() over median() to conserve memory;      ^^^^^!
    #  also avoids missing values issue, 
    #  e.g. weekend or holidays gaps for daily data.
    secs_timedelta64 = tools.div( nanosecs_timedelta64, 1e9 )
    #  To avoid numerical error, we divide before converting type: 
    secs = secs_timedelta64.astype( np.float32 )
    if secs == 0.0:
        system.warn('Index contains duplicate, min delta was 0.')
        return secs
    else:
        return secs

    #  There are OTHER METHODS to get the FREQUENCY of a dataframe:
    #       e.g.  df.index.freq  OR  df.index.freqstr , 
    #  however, these work only if the frequency was attributed:
    #       e.g.  '1 Hour'       OR  'H'  respectively. 
    #  The fecon235 derived dataframes will usually return None.
    #  
    #  Two timedelta64 units, 'Y' years and 'M' months, are 
    #  specially treated because the time they represent depends upon
    #  their context. While a timedelta64 day unit is equivalent to 
    #  24 hours, there is difficulty converting a month unit into days 
    #  because months have varying number of days. 
    #       Other numpy timedelta64 units can be found here: 
    #  http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
    #  
    #  For pandas we could do:  pd.infer_freq( df.index )
    #  which, for example, might output 'B' for business daily series.
    #  
    #  But the STRING representation of index frequency is IMPRACTICAL
    #  since we may want to compare two unevenly timed indexes. 
    #  That comparison is BEST DONE NUMERICALLY in some common unit 
    #  (we use seconds since that is the Unix epoch convention).
    #
    #  Such comparison will be crucial for the machine 
    #  to chose whether downsampling or upsampling is appropriate.
    #  The casual user should not be expected to know the functions
    #  within index_delta_secs() to smoothly work with a notebook.

#  For details on frequency conversion, see McKinney 2013, 
#       Chp. 10 RESAMPLING, esp. Table 10-5 on downsampling.
#       pandas defaults are:  how='mean', closed='right', label='right'
#
#  2014-08-10  closed and label to the 'left' conform to FRED practices.
#              how='median' since it is more robust than 'mean'. 
#  2014-08-14  If upsampling, interpolate() does linear evenly, 
#              disregarding uneven time intervals.
#  2016-11-06  McKinney 2013 on resampling is outdated as of pandas 0.18

def resample_main( dataframe, rule, secs ):
    '''Generalized resample routine for downsampling or upsampling.'''
    #  rule is the offset string or object representing target conversion,
    #       e.g. 'B', 'MS', or 'QS-OCT' to be compatible with FRED.
    #  secs should be the maximum seconds expected for rule frequency.
    if index_delta_secs(dataframe) < secs:
        df = dataframe.resample(rule, closed='left', label='left').median()
        #    how='median' for DOWNSAMPLING deprecated as of pandas 0.18
        return df
    else:
        df = dataframe.resample(rule, closed='left', label='left').fillna(None)
        #    fill_method=None for UPSAMPLING deprecated as of pandas 0.18
        #    note that None almost acts like np.nan which fails as argument.
        #    interpolate() applies to those filled nulls when upsampling:
        #    'linear' ignores index values treating it as equally spaced.
        return df.interpolate(method='linear')

def daily( dataframe ):
    '''Resample data to daily using only business days.'''
    #                         'D' is used calendar daily
    #                         'B' for business daily
    secs1day2hours = 93600.0
    return resample_main( dataframe, 'B', secs1day2hours )

def monthly( dataframe ):
    '''Resample data to FRED's month start frequency.'''
    #  FRED uses the start of the month to index its monthly data.
    #                         'M'  is used for end of month.
    #                         'MS' for start of month.
    secs31days = 2678400.0
    return resample_main( dataframe, 'MS', secs31days )

def quarterly( dataframe ):
    '''Resample data to FRED's quarterly start frequency.'''
    #  FRED uses the start of the month to index its monthly data.
    #  Then for quarterly data: 1-01, 4-01, 7-01, 10-01.
    #                            Q1    Q2    Q3     Q4
    #  ________ Start at first of months,
    #  ________ for year ending in indicated month.
    #  'QS-OCT'
    secs93days = 8035200.0
    return resample_main( dataframe, 'QS-OCT', secs93days )
rsvp commented 6 years ago

2018 Addendum

The fecon235 source code was refactored in https://git.io/fecon236

Here's the specific module which fixes the issue: https://github.com/MathSci/fecon236/blob/master/fecon236/host/fred.py