pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.65k stars 17.92k forks source link

read_fwf(), read_table() are mangling columns #3594

Closed danmd closed 11 years ago

danmd commented 11 years ago

version 0.11.0 has introduced a confusing behavior when importing data via read_fwf (and, I'm pretty sure, read_table)

import pandas as pd
from cStringIO import StringIO
from datetime import datetime

tzlist = [1,10,20,30,60,80,100]
ntz = len(tzlist)
tcolspecs = [16]+[8]*ntz
tcolnames = ['SST'] + ["T%03d" % z for z in tzlist[1:]]
data = '''  2009164202000   9.5403  9.4105  8.6571  7.8372  6.0612  5.8843  5.5192
  2009164203000   9.5435  9.2010  8.6167  7.8176  6.0804  5.8728  5.4869
  2009164204000   9.5873  9.1326  8.4694  7.5889  6.0422  5.8526  5.4657
  2009164205000   9.5810  9.0896  8.4009  7.4652  6.0322  5.8189  5.4379
  2009164210000   9.6034  9.0897  8.3822  7.4905  6.0908  5.7904  5.4039'''
dftemp = pd.read_fwf(StringIO(data),
                     index_col=0,
                     header=None,
                     names=tcolnames,
                     widths=tcolspecs,
                     parse_dates=True,
                     date_parser=lambda s: datetime.strptime(s,'%Y%j%H%M%S'))

With version 0.10.1

In [1]: pd.__version__
Out[1]: '0.10.1'

In [2]: dftemp
Out[2]: 
                        SST    T010    T020    T030    T060    T080    T100
2009-06-13 20:20:00  9.5403  9.4105  8.6571  7.8372  6.0612  5.8843  5.5192
2009-06-13 20:30:00  9.5435  9.2010  8.6167  7.8176  6.0804  5.8728  5.4869
2009-06-13 20:40:00  9.5873  9.1326  8.4694  7.5889  6.0422  5.8526  5.4657
2009-06-13 20:50:00  9.5810  9.0896  8.4009  7.4652  6.0322  5.8189  5.4379
2009-06-13 21:00:00  9.6034  9.0897  8.3822  7.4905  6.0908  5.7904  5.4039

In [3]: dftemp.T030
Out[3]: 
2009-06-13 20:20:00    7.8372
2009-06-13 20:30:00    7.8176
2009-06-13 20:40:00    7.5889
2009-06-13 20:50:00    7.4652
2009-06-13 21:00:00    7.4905
Name: T030, dtype: float64

In [4]: dftemp.T060
Out[4]: 
2009-06-13 20:20:00    6.0612
2009-06-13 20:30:00    6.0804
2009-06-13 20:40:00    6.0422
2009-06-13 20:50:00    6.0322
2009-06-13 21:00:00    6.0908
Name: T060

In [5]: dftemp.T080
Out[5]: 
2009-06-13 20:20:00    5.8843
2009-06-13 20:30:00    5.8728
2009-06-13 20:40:00    5.8526
2009-06-13 20:50:00    5.8189
2009-06-13 21:00:00    5.7904
Name: T080

In [6]: dftemp.T100
Out[6]: 
2009-06-13 20:20:00    5.5192
2009-06-13 20:30:00    5.4869
2009-06-13 20:40:00    5.4657
2009-06-13 20:50:00    5.4379
2009-06-13 21:00:00    5.4039
Name: T100, dtype: float64

and, with version 0.11.0

In [1]: pd.__version__
Out[1]: '0.11.0'

In [2]: dftemp
Out[2]: 
                        SST    T010    T020    T030    T060    T080    T100
2009-06-13 20:20:00  9.5403  9.4105  8.6571  7.8372  6.0612  5.8843  5.5192
2009-06-13 20:30:00  9.5435  9.2010  8.6167  7.8176  6.0804  5.8728  5.4869
2009-06-13 20:40:00  9.5873  9.1326  8.4694  7.5889  6.0422  5.8526  5.4657
2009-06-13 20:50:00  9.5810  9.0896  8.4009  7.4652  6.0322  5.8189  5.4379
2009-06-13 21:00:00  9.6034  9.0897  8.3822  7.4905  6.0908  5.7904  5.4039

In [3]: dftemp.T030
Out[3]: 
2009-06-13 20:20:00    7.8372
2009-06-13 20:30:00    7.8176
2009-06-13 20:40:00    7.5889
2009-06-13 20:50:00    7.4652
2009-06-13 21:00:00    7.4905
Name: T030, dtype: float64

In [4]: dftemp.T060
Out[4]: 
Empty DataFrame
Columns: [SST, T010, T020, T030, T060, T080, T100]
Index: []

In [5]: dftemp.T080
Out[5]: 
Empty DataFrame
Columns: [SST, T010, T020, T030, T060, T080, T100]
Index: []

In [6]: dftemp.T100
Out[6]: 
2009-06-13 20:20:00    5.5192
2009-06-13 20:30:00    5.4869
2009-06-13 20:40:00    5.4657
2009-06-13 20:50:00    5.4379
2009-06-13 21:00:00    5.4039
Name: T100, dtype: float64

No matter how many columns I've been importing (up to 32 in some cases), it seems like it is always the 5th and 6th columns getting hit.

jreback commented 11 years ago

This turned out to be a really esoteric case! In 0.11, we enabled time-series slicing via string shortcuts

The following is a normal selection via string based time-series slicingthe

In [7]: df = DataFrame(randn(10,2),index=date_range('20600101',periods=10,freq='Q'))

In [8]: df
Out[8]: 
                   0         1
2060-03-31  0.805302  1.483985
2060-06-30 -0.606980 -0.722344
2060-09-30 -0.715166 -0.065888
2060-12-31  0.086402  2.754483
2061-03-31 -0.578437 -0.262175
2061-06-30  1.016819 -1.214755
2061-09-30  0.691813  0.954097
2061-12-31  0.757279  1.573809
2062-03-31 -0.666067 -1.630134
2062-06-30  0.071744 -0.988124

In [9]: df['2060']
Out[9]: 
                   0         1
2060-03-31  0.805302  1.483985
2060-06-30 -0.606980 -0.722344
2060-09-30 -0.715166 -0.065888
2060-12-31  0.086402  2.754483

However, for some really weird reason, 'T060' (and only that) parses out to a equivalent to '2060' must be a shortcut some how

In [10]: df['T060']
Out[10]: 
                   0         1
2060-03-31  0.805302  1.483985
2060-06-30 -0.606980 -0.722344
2060-09-30 -0.715166 -0.065888
2060-12-31  0.086402  2.754483

I fixed this another way, in any event

thanks for the report

@wesm take a look at this!

jreback commented 11 years ago

closed by #3597