Closed rsvp closed 7 years ago
An immediate remedy is to downgrade to pandas 0.18.0 or 0.18.1 if you fatally encounter this issue.
The problem summarized: for pandas API > 0.18, you can either downsample OR upsample, but not both.
The prior API implementations would allow you to pass an aggregator function (e.g. mean) even though you were upsampling, providing a bit of confusion.
Thus fecon235 resampling functions which have been working under both upsampling and downsampling situations will break e.g. see yi_fred code.
So is there a pandas way to detect which type of sampling is being requested given the data argument? Otherwise, the fix may have to involve an additional mandatory flag, and tedious edits across many fecon235 notebooks.
def index_delta_secs( dataframe ):
'''Find minimum in seconds between index values.'''
nanosecs_timedelta64 = np.diff(dataframe.index.values).min()
# Picked min() over median() to conserve memory; ^^^^^!
# also avoids missing values issue,
# e.g. weekend or holidays gaps for daily data.
secs_timedelta64 = tools.div( nanosecs_timedelta64, 1e9 )
# To avoid numerical error, we divide before converting type:
secs = secs_timedelta64.astype( np.float32 )
if secs == 0.0:
system.warn('Index contains duplicate, min delta was 0.')
return secs
else:
return secs
# There are OTHER METHODS to get the FREQUENCY of a dataframe:
# e.g. df.index.freq OR df.index.freqstr ,
# however, these work only if the frequency was attributed:
# e.g. '1 Hour' OR 'H' respectively.
# The fecon235 derived dataframes will usually return None.
#
# Two timedelta64 units, 'Y' years and 'M' months, are
# specially treated because the time they represent depends upon
# their context. While a timedelta64 day unit is equivalent to
# 24 hours, there is difficulty converting a month unit into days
# because months have varying number of days.
# Other numpy timedelta64 units can be found here:
# http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
#
# For pandas we could do: pd.infer_freq( df.index )
# which, for example, might output 'B' for business daily series.
#
# But the STRING representation of index frequency is IMPRACTICAL
# since we may want to compare two unevenly timed indexes.
# That comparison is BEST DONE NUMERICALLY in some common unit
# (we use seconds since that is the Unix epoch convention).
#
# Such comparison will be crucial for the machine
# to chose whether downsampling or upsampling is appropriate.
# The casual user should not be expected to know the functions
# within index_delta_secs() to smoothly work with a notebook.
# For details on frequency conversion, see McKinney 2013,
# Chp. 10 RESAMPLING, esp. Table 10-5 on downsampling.
# pandas defaults are: how='mean', closed='right', label='right'
#
# 2014-08-10 closed and label to the 'left' conform to FRED practices.
# how='median' since it is more robust than 'mean'.
# 2014-08-14 If upsampling, interpolate() does linear evenly,
# disregarding uneven time intervals.
# 2016-11-06 McKinney 2013 on resampling is outdated as of pandas 0.18
def resample_main( dataframe, rule, secs ):
'''Generalized resample routine for downsampling or upsampling.'''
# rule is the offset string or object representing target conversion,
# e.g. 'B', 'MS', or 'QS-OCT' to be compatible with FRED.
# secs should be the maximum seconds expected for rule frequency.
if index_delta_secs(dataframe) < secs:
df = dataframe.resample(rule, closed='left', label='left').median()
# how='median' for DOWNSAMPLING deprecated as of pandas 0.18
return df
else:
df = dataframe.resample(rule, closed='left', label='left').fillna(None)
# fill_method=None for UPSAMPLING deprecated as of pandas 0.18
# note that None almost acts like np.nan which fails as argument.
# interpolate() applies to those filled nulls when upsampling:
# 'linear' ignores index values treating it as equally spaced.
return df.interpolate(method='linear')
def daily( dataframe ):
'''Resample data to daily using only business days.'''
# 'D' is used calendar daily
# 'B' for business daily
secs1day2hours = 93600.0
return resample_main( dataframe, 'B', secs1day2hours )
def monthly( dataframe ):
'''Resample data to FRED's month start frequency.'''
# FRED uses the start of the month to index its monthly data.
# 'M' is used for end of month.
# 'MS' for start of month.
secs31days = 2678400.0
return resample_main( dataframe, 'MS', secs31days )
def quarterly( dataframe ):
'''Resample data to FRED's quarterly start frequency.'''
# FRED uses the start of the month to index its monthly data.
# Then for quarterly data: 1-01, 4-01, 7-01, 10-01.
# Q1 Q2 Q3 Q4
# ________ Start at first of months,
# ________ for year ending in indicated month.
# 'QS-OCT'
secs93days = 8035200.0
return resample_main( dataframe, 'QS-OCT', secs93days )
The fecon235 source code was refactored in https://git.io/fecon236
Here's the specific module which fixes the issue: https://github.com/MathSci/fecon236/blob/master/fecon236/host/fred.py
Description of specific issue
When resampling a time-series the following warning(s) will appear:
It is somewhat cryptic until one realizes how='median' was being used as an argument to the .resample function. So how becomes the problem for yi_fred module, specifically for our functions daily(), monthly(), and quarterly() in fecon235.
(Sidenote: how='median' since it is more robust than 'mean'.)
The second cryptic warning can be traced to our use of fill_method=None when upsampling. The new API urges us to instead use methods:
Expected behavior
No such warning, possibly fatal termination.
Observed behavior
Warnings started as of pandas 0.18
Why would the improvement be useful to most users?
Because daily(), weekly(), and monthly() in fecon235 should just work without the casual user needing to learn
obscure flags and methods (subject to future API changes).
Additional helpful details for bugs
[x] Problem started recently, but not in older versions
[ ] Problem happens with all files, not only some files
[x] Problem can be reliably reproduced
[ ] Problem happens randomly
fecon235 version: v4.16.1030
pandas version: 0.18
Python version: both 2.7 and 3
Operating system: cross-platform