pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.34k stars 17.81k forks source link

csv parsing is too restrictive #9294

Closed kousu closed 9 years ago

kousu commented 9 years ago

On this csv file:

DecisionM,IntelligentM,freq,total
0, 5, 9, 20 
0, 6, 21,33
0, 7, 35,65
0, 8, 35,83
0, 9, 14,41
0, 10, 10,26
1, 5, 11,20
1, 6, 12,33
1, 7, 30,65
1, 8, 48,83
1, 9, 27, 41,,
1, 10, 16, 26

pandas.read_csv() gives me

---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-9-28b9c031f12d> in <module>()
----> 1 pandas.read_csv("speeddating.csv")

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    463                     skip_blank_lines=skip_blank_lines)
    464 
--> 465         return _read(filepath_or_buffer, kwds)
    466 
    467     parser_f.__name__ = name

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    249         return parser
    250 
--> 251     return parser.read()
    252 
    253 _parser_defaults = {

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
    708                 raise ValueError('skip_footer not supported for iteration')
    709 
--> 710         ret = self._engine.read(nrows)
    711 
    712         if self.options.get('as_recarray'):

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
   1157 
   1158         try:
-> 1159             data = self._reader.read(nrows)
   1160         except StopIteration:
   1161             if nrows is None:

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader.read (pandas/parser.c:7403)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7643)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._read_rows (pandas/parser.c:8265)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8139)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.raise_parser_error (pandas/parser.c:20776)()

CParserError: Error tokenizing data. C error: Expected 4 fields in line 12, saw 6

This seems odd to me. There's nothing in those fields to worry about anyway, they can just be dropped. Even if there was data there, if it doesn't have a column to go with it should be dropped too. Lots of csv files are messy like this; that's basically why they're used all the time.

I am on pandas '0.15.2' and Python3

jreback commented 9 years ago

You need to explicity allow this, otherwise its designed to give an error.

In [8]: read_csv(StringIO(data),error_bad_lines=False)
Skipping line 12: expected 4 fields, saw 6

Out[8]: 
    DecisionM  IntelligentM  freq  total
0           0             5     9     20
1           0             6    21     33
2           0             7    35     65
3           0             8    35     83
4           0             9    14     41
5           0            10    10     26
6           1             5    11     20
7           1             6    12     33
8           1             7    30     65
9           1             8    48     83
10          1            10    16     26
szeitlin commented 8 years ago

when I try adding error_bad_lines=False to mine, I get this very verbose output:

b'Skipping line 6: expected 2 fields, saw 8\nSkipping line 7: expected 2 fields, saw 8\nSkipping line 8: expected 2 fields, saw 8\nSkipping line 9: expected 2 fields, saw 8\nSkipping line 10: expected 2 fields, saw 8\nSkipping line 11: expected 2 fields, saw 8\nSkipping line 12: expected 2 fields, saw 8\nSkipping line 13: expected 2 fields, saw 8\nSkipping line 14: expected 2 fields, saw 8\nSkipping line 15: expected 2 fields, saw 8\ etc.

Looks like pandas is assuming the first line is the start of the data table, and is trying to tokenize based on the number of fields it found there, even though the entire rest of the file has 8 fields per row.

I know that the header on this file is such that if I do skiprows=4, pandas can read it just fine.

I don't think R has the ability to autodetect headers, but it sure would be nice if pandas had an option to do it, or if we could specify the number of rows to look at, to help determine what might be a header (?). Just an idea.

TomAugspurger commented 8 years ago

Just my 2 cents, but I don't think pandas should try to guess here. Even though warning on every line past the first one is almost surely not what the user wants, it could also indicate a bad file, where a column wasn't written correctly.

Perhaps with another keyword or option to read_csv, but even then this seems like something that's tough to get right often enough to rely on. My preference is to put that burden on the user. Thoughts?

kawochen commented 8 years ago

Input files should follow RFC4180 as closely as possible. The optional header should be the first line, so we shouldn't need to look past the first line to determine the header.

szeitlin commented 8 years ago

My point is that it's a common problem, and pandas could help alleviate the issue, which would save everyone time and duplicated effort.

Even if there's just an optional flag to return a list of tuples structured as (line number, number of fields in that line), that kind of output would be more useful than what it's already doing, which actually already includes that information, just not in a usable format.

jreback commented 8 years ago

@szeitlin their are a number of issues w.r.t. how to 'report' bad/warnings of lines, e.g. see https://github.com/pydata/pandas/issues/5686

so maybe someone can propose an API, e.g. we could do something like

replicates the current api

df = read_csv(......, errors='raise')
df = read_csv(......, errors='warn')
df, errors = read_csv(...., return_errors=True)

and errors be something like:

errors = [ { 'line' : 10, 'contents' : ..... ,'error' : ..... }, ..... ]
randomgambit commented 7 years ago

something I was curious about is whether we can still retrieve these bogus cells in some way... My understanding is that error_bad_lines = False will just skip these lines. What about keeping them and putting some fake column to get them?

szeitlin commented 7 years ago

So this was a while ago, but I ended up actually catching those errors and parsing them, in order to figure out what was going on with my files. It worked ok, but it's probably kind of fragile (if anything about this part of pandas changes, it will break). I'd post the code here, but it was at my old job, so I don't have access to the repo anymore. 😂

randomgambit commented 7 years ago

yeah these bogus lines are difficult to catch unfortunately (at least not in native Pandas) http://stackoverflow.com/questions/38902553/error-tokenizing-data-during-pandas-read-csv-how-to-actually-see-the-bad-lines