Closed kousu closed 9 years ago
You need to explicity allow this, otherwise its designed to give an error.
In [8]: read_csv(StringIO(data),error_bad_lines=False)
Skipping line 12: expected 4 fields, saw 6
Out[8]:
DecisionM IntelligentM freq total
0 0 5 9 20
1 0 6 21 33
2 0 7 35 65
3 0 8 35 83
4 0 9 14 41
5 0 10 10 26
6 1 5 11 20
7 1 6 12 33
8 1 7 30 65
9 1 8 48 83
10 1 10 16 26
when I try adding error_bad_lines=False
to mine, I get this very verbose output:
b'Skipping line 6: expected 2 fields, saw 8\nSkipping line 7: expected 2 fields, saw 8\nSkipping line 8: expected 2 fields, saw 8\nSkipping line 9: expected 2 fields, saw 8\nSkipping line 10: expected 2 fields, saw 8\nSkipping line 11: expected 2 fields, saw 8\nSkipping line 12: expected 2 fields, saw 8\nSkipping line 13: expected 2 fields, saw 8\nSkipping line 14: expected 2 fields, saw 8\nSkipping line 15: expected 2 fields, saw 8\
etc.
Looks like pandas is assuming the first line is the start of the data table, and is trying to tokenize based on the number of fields it found there, even though the entire rest of the file has 8 fields per row.
I know that the header on this file is such that if I do skiprows=4, pandas can read it just fine.
I don't think R has the ability to autodetect headers, but it sure would be nice if pandas had an option to do it, or if we could specify the number of rows to look at, to help determine what might be a header (?). Just an idea.
Just my 2 cents, but I don't think pandas should try to guess here. Even though warning on every line past the first one is almost surely not what the user wants, it could also indicate a bad file, where a column wasn't written correctly.
Perhaps with another keyword or option to read_csv
, but even then this seems like something that's tough to get right often enough to rely on. My preference is to put that burden on the user. Thoughts?
Input files should follow RFC4180 as closely as possible. The optional header should be the first line, so we shouldn't need to look past the first line to determine the header.
My point is that it's a common problem, and pandas could help alleviate the issue, which would save everyone time and duplicated effort.
Even if there's just an optional flag to return a list of tuples structured as (line number, number of fields in that line), that kind of output would be more useful than what it's already doing, which actually already includes that information, just not in a usable format.
@szeitlin their are a number of issues w.r.t. how to 'report' bad/warnings of lines, e.g. see https://github.com/pydata/pandas/issues/5686
so maybe someone can propose an API, e.g. we could do something like
replicates the current api
df = read_csv(......, errors='raise')
df = read_csv(......, errors='warn')
df, errors = read_csv(...., return_errors=True)
and errors be something like:
errors = [ { 'line' : 10, 'contents' : ..... ,'error' : ..... }, ..... ]
something I was curious about is whether we can still retrieve these bogus cells in some way... My understanding is that error_bad_lines = False
will just skip these lines. What about keeping them and putting some fake column to get them?
So this was a while ago, but I ended up actually catching those errors and parsing them, in order to figure out what was going on with my files. It worked ok, but it's probably kind of fragile (if anything about this part of pandas changes, it will break). I'd post the code here, but it was at my old job, so I don't have access to the repo anymore. 😂
yeah these bogus lines are difficult to catch unfortunately (at least not in native Pandas) http://stackoverflow.com/questions/38902553/error-tokenizing-data-during-pandas-read-csv-how-to-actually-see-the-bad-lines
On this csv file:
pandas.read_csv()
gives meThis seems odd to me. There's nothing in those fields to worry about anyway, they can just be dropped. Even if there was data there, if it doesn't have a column to go with it should be dropped too. Lots of csv files are messy like this; that's basically why they're used all the time.
I am on pandas '0.15.2' and Python3