pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

Intermittent issue with rolling_min function - Calculation "blows up" #12073

Closed DiSchi123 closed 8 years ago

DiSchi123 commented 8 years ago

I noticed a strange behavior that happens intermittently and only happens on my laptop. Not reproducible on my Windows 8 Desktop which concerns me.

Run the following:

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

#inputdata
windows=np.array([5,10,15,60,120,300,0,0,0,0] )  #parameters
dates=pd.date_range('1/1/2000 15:00:00', periods =100000, freq='1H')
hitemps=-50+np.round(np.random.rand(100000),3)*100  # simulated  hourly high temperature    readings
lowtemps=hitemps-2*np.round(np.random.rand(100000),3)  # low temps
d = {'HiTemp': hitemps, 'LoTemp' : lowtemps}

When running the following about 10 to 20 times, about 1 out of ~10 times the columns get populated with weird data. With my actual data set (too long to attach) it happens more often, like 1 in 4 times. Run the below code repeatedly. Sample problem and correct output attached in xls (further below):

df=DataFrame(data=d, index=dates)

# create new df columns of forward looking moves down, with varying time frames (window sizes). use window size in column name
for i, windowsize in enumerate(windows):   
    if windowsize==0:
        continue
    else:
        colname_df=str('Fw'+str(windowsize)+'Dwn')
        df[colname_df]=df.HiTemp-(pd.rolling_min(df.LoTemp,  window=windowsize).shift(-windowsize))  
df

rolling_min bug report outputs.xlsx

This was captured on a 2014 MacBook Air running Windows 10 via Bootcamp (see above comment, problem not happening on HP Windows Desktop). Installed versions result: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.17.1 nose: 1.3.7 pip: 7.1.2 setuptools: 19.1.1 Cython: 0.23.4 numpy: 1.10.1 scipy: 0.16.0 statsmodels: 0.6.1 IPython: 4.0.1 sphinx: 1.3.1 patsy: 0.4.0 dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.4.4 matplotlib: 1.5.0 openpyxl: 2.2.6 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.7.7 lxml: 3.4.4 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.9 pymysql: None psycopg2: None Jinja2: None

jreback commented 8 years ago

this is might be related to using numexpr=2.4.4, try upgrading to 2.4.6 and see if you can repro

see #12023

jreback commented 8 years ago

btw, what you are doing in that loop is highly inefficient.

much better to:

l = []
for i, windowsize in enumerate(windows):   
    if  colname_df=str('Fw'+str(windowsize)+'Dwn')
        s = df.HiTemp-(pd.rolling_min(df.LoTemp,  window=windowsize).shift(-windowsize))
        s.name = colname_df
        l.append(l)
df = pd.concat(l,axis=1)
DiSchi123 commented 8 years ago

Thanks! I suspected my loop is not ideal although I improved it quite a bit up to this. I couldn't find anything on custom named columns.

Will get back once I know more re version..

DiSchi123 commented 8 years ago

Right on the money! I upgraded numexpr to 2.4.6 and the issue is gone. My Windows PC is at 2.3.1. of numexpr, I suppose the problem occured in between somewhere.

DiSchi123 commented 8 years ago

Problem resolved - issue was numexpr 2.4.4. Upgraded to 2.4.6 and problem gone.

jreback commented 8 years ago

gr8! thanks.

yeh have seen that a few times. I think its a bug (only) on windows with numexpr 2.4.4