pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.86k stars 18.01k forks source link

BUG: Resample upsampling return NaNs #9528

Open KevinLourd opened 9 years ago

KevinLourd commented 9 years ago

Pandas resample bugs when upsampling a time serie with same size splits :

For instance, I have a time serie of size 10:

rng = pd.date_range('20130101',periods=10,freq='T')
ts=pd.Series(np.random.randn(len(rng)), index=rng)

print(ts)

2013-01-01 00:00:00   -1.811999
2013-01-01 00:01:00   -0.890837
2013-01-01 00:02:00   -0.363520
2013-01-01 00:03:00   -0.026245
2013-01-01 00:04:00    1.515072
2013-01-01 00:05:00    0.920129
2013-01-01 00:06:00   -0.125954
2013-01-01 00:07:00    0.588933
2013-01-01 00:08:00   -1.278408
2013-01-01 00:09:00   -0.172525
Freq: T, dtype: float64

When trying to resample in N > 10 parts it doesn't work:

from datetime import timedelta
length = 11
timeSpan = (ts.index[-1]-ts.index[0]+timedelta(minutes=1))
rule = int(timeSpan.total_seconds()/length)
tsNew=ts.resample(str(rule)+"S").mean()

print(tsNew)

2013-01-01 00:00:00    1.845181
2013-01-01 00:00:54         NaN
2013-01-01 00:01:48         NaN
2013-01-01 00:02:42         NaN
2013-01-01 00:03:36         NaN
2013-01-01 00:04:30         NaN
2013-01-01 00:05:24         NaN
2013-01-01 00:06:18         NaN
2013-01-01 00:07:12         NaN
2013-01-01 00:08:06         NaN
2013-01-01 00:09:00   -0.997419
Freq: 54S, dtype: float64

Note: here is my versions: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Darwin
OS-release: 14.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.1
scipy: 0.15.1
statsmodels: 0.5.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.1
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: 0.6.3.None
psycopg2: None

Thank you for your help

jreback commented 9 years ago

I don't think this is a bug per se, rather a convention / api issue.

IIRC (and i'll have to look further), it is actually reindexing here. (that's why the stamps that match with your original have values, but the others don't).

Doesn't seem very useful though.

In [1]: rng = pd.date_range('20130101',periods=10,freq='T')

In [2]: ts=pd.Series(np.arange(len(rng)), index=rng)

In [8]: ts.resample('54s',how='mean')
Out[8]: 
2013-01-01 00:00:00     0
2013-01-01 00:00:54     1
2013-01-01 00:01:48     2
2013-01-01 00:02:42     3
2013-01-01 00:03:36     4
2013-01-01 00:04:30     5
2013-01-01 00:05:24     6
2013-01-01 00:06:18     7
2013-01-01 00:07:12     8
2013-01-01 00:08:06   NaN
2013-01-01 00:09:00     9
Freq: 54S, dtype: float64

In [9]: ts.resample('54s')
Out[9]: 
2013-01-01 00:00:00     0
2013-01-01 00:00:54   NaN
2013-01-01 00:01:48   NaN
2013-01-01 00:02:42   NaN
2013-01-01 00:03:36   NaN
2013-01-01 00:04:30   NaN
2013-01-01 00:05:24   NaN
2013-01-01 00:06:18   NaN
2013-01-01 00:07:12   NaN
2013-01-01 00:08:06   NaN
2013-01-01 00:09:00     9
Freq: 54S, dtype: float64
jreback commented 9 years ago

what would your expectation be for the result using the input of np.arange(len(ts)) ?

KevinLourd commented 9 years ago

I would expect the output[8] that you printed (thank you for the how="mean" tip). However, that is not working, as explained below:

Taking for instance a smaller input set:

rng = pd.date_range('20130101',periods=3,freq='T')
ts=pd.Series(np.arange(len(rng)), index=rng)
print(ts)
2013-01-01 00:00:00    0
2013-01-01 00:01:00    1
2013-01-01 00:02:00    2
Freq: T, dtype: int64

When trying to divide in 5 parts, we have only 4... :

from datetime import timedelta
length = 5
timeSpan = (ts.index[-1]-ts.index[0]+timedelta(minutes=1))
rule = int(timeSpan.total_seconds()/length)
tsNew=ts.resample(str(rule)+"S").mean()
print(tsNew)
2013-01-01 00:00:00     0
2013-01-01 00:00:36     1
2013-01-01 00:01:12   NaN
2013-01-01 00:01:48     2
Freq: 36S, dtype: float64

I would expect an extra line with a 2 or a NaN like this:

2013-01-01 00:02:24     NaN

The example taken by jreback is a particular case, since it is rounded at 00:09:00 minutes, that is why there is the correct number of row that appears

jreback commented 9 years ago

So the fill_method argument applies to the filling for upsample (which is odd because its not consistent with other methods).

That said, there are a LOT of options for resample.

In [17]: ts.resample('36s',fill_method='pad',closed='right')
Out[17]: 
2013-01-01 00:00:00    0
2013-01-01 00:00:36    0
2013-01-01 00:01:12    1
2013-01-01 00:01:48    1
2013-01-01 00:02:24    2
Freq: 36S, dtype: int64
jreback commented 9 years ago

Just remembered for the first example, this requires upsampling so fill_method applies.

In [21]: ts.resample('54s',fill_method='pad')
Out[21]: 
2013-01-01 00:00:00    0
2013-01-01 00:00:54    0
2013-01-01 00:01:48    1
2013-01-01 00:02:42    2
2013-01-01 00:03:36    3
2013-01-01 00:04:30    4
2013-01-01 00:05:24    5
2013-01-01 00:06:18    6
2013-01-01 00:07:12    7
2013-01-01 00:08:06    8
2013-01-01 00:09:00    9
Freq: 54S, dtype: int64
KevinLourd commented 9 years ago

ts.resample('36s',fill_method='pad',closed='right') works fine. Although there is no rational reason to be obliged to put closed=right since what is expected here is a closed=left...