pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

Type inference problem with read_csv #9669

Closed diehl closed 9 years ago

diehl commented 9 years ago

I have a large CSV file that contains a single column of integers. When loading the CSV file with read_csv, nearly 2/3rds of the approximately 1.5 million values are loaded as ints while the remaining values are loaded as strings. I see no obvious problem with the file that would lead to this behavior. The file I used is available here: https://drive.google.com/file/d/0ByZvgdTf0yfAT2dPdHRvc2hLVkU/view?usp=sharing

test_df = pd.read_csv('test.csv',header=0)
from collections import Counter
c = Counter()
els = []
for el in test_df['SERIALNO'].values:
    c[type(el)] += 1
print c
Counter({<type 'int'>: 952026, <type 'str'>: 524288})

It appears that a continguous block of rows were read as strings - a block in the middle of the file. Very curious.

pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.9.final.0 python-bits: 64 OS: Darwin OS-release: 13.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.15.2 nose: 1.3.4 Cython: 0.21 numpy: 1.9.2 scipy: 0.15.1 statsmodels: 0.5.0 IPython: 3.0.0 sphinx: 1.2.3 patsy: 0.3.0 dateutil: 2.4.1 pytz: 2014.9 bottleneck: None tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.4.0 openpyxl: 2.1.0 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.5.7 lxml: 3.4.0 bs4: 4.3.2 html5lib: None httplib2: None apiclient: None rpy2: None sqlalchemy: 0.9.7 pymysql: None psycopg2: None

jreback commented 9 years ago

xref is #3866

something is throwing the inference engine off. You can try iterating by chunks and seeing where the invalid data is.

After its read in, you could do .convert_objects(convert_numeric=True). If this works might be a deeper issue.

diehl commented 9 years ago

@jreback I just tried the convert_objects call and it works. I've manually inspected the CSV file hunting for hidden characters and have seen nothing.

I found a similar problem on Stack Overflow that suggests this problem has been around for a bit. http://stackoverflow.com/questions/18471859/pandas-read-csv-dtype-inference-issue

All the evidence so far points to a deeper issue.

jreback commented 9 years ago

@diehl your pointed to issue is really completely different, though if you DID have actual string-likes then I suppose it could be the same (though it IS really hard to inspect visually for this kind of thing). That's why I suggested you iterate in read_csv using chunksize=1000 or something, and narrow down which chunk gives this error. OR if it doesn't then we can discuss a bug from there.

diehl commented 9 years ago

@jreback false alarm. after more digging I found the issue with the file. thanks for the feedback.

jreback commented 9 years ago

np