Closed diehl closed 9 years ago
xref is #3866
something is throwing the inference engine off. You can try iterating by chunks and seeing where the invalid data is.
After its read in, you could do .convert_objects(convert_numeric=True)
. If this works might be a deeper issue.
@jreback I just tried the convert_objects call and it works. I've manually inspected the CSV file hunting for hidden characters and have seen nothing.
I found a similar problem on Stack Overflow that suggests this problem has been around for a bit. http://stackoverflow.com/questions/18471859/pandas-read-csv-dtype-inference-issue
All the evidence so far points to a deeper issue.
@diehl your pointed to issue is really completely different, though if you DID have actual string-likes then I suppose it could be the same (though it IS really hard to inspect visually for this kind of thing). That's why I suggested you iterate in read_csv using chunksize=1000 or something, and narrow down which chunk gives this error. OR if it doesn't then we can discuss a bug from there.
@jreback false alarm. after more digging I found the issue with the file. thanks for the feedback.
np
I have a large CSV file that contains a single column of integers. When loading the CSV file with read_csv, nearly 2/3rds of the approximately 1.5 million values are loaded as ints while the remaining values are loaded as strings. I see no obvious problem with the file that would lead to this behavior. The file I used is available here: https://drive.google.com/file/d/0ByZvgdTf0yfAT2dPdHRvc2hLVkU/view?usp=sharing
It appears that a continguous block of rows were read as strings - a block in the middle of the file. Very curious.
pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.9.final.0 python-bits: 64 OS: Darwin OS-release: 13.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8
pandas: 0.15.2 nose: 1.3.4 Cython: 0.21 numpy: 1.9.2 scipy: 0.15.1 statsmodels: 0.5.0 IPython: 3.0.0 sphinx: 1.2.3 patsy: 0.3.0 dateutil: 2.4.1 pytz: 2014.9 bottleneck: None tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.4.0 openpyxl: 2.1.0 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.5.7 lxml: 3.4.0 bs4: 4.3.2 html5lib: None httplib2: None apiclient: None rpy2: None sqlalchemy: 0.9.7 pymysql: None psycopg2: None