Open mcocdawc opened 7 years ago
I suspect the problem is that we aren't stopping the parsing for the C engine when we passing nrows
. I think we try to respect it in some cases, but those are likely insufficient it seems.
As for the Python parser, that does bewilder me a bit. We are wrapping Python's csv
reader class in most cases, so I suspect something is going with the interaction that causes it output the "weird" results that you are seeing.
PR to investigate and patch is welcome!
Can confirm this problem still exists with version 0.21
Can confirm this problem also still exists as of master branch on May 31, 2018:
commit 4274b840e64374a39a0285c2174968588753ec35
Date: Thu May 31 19:14:33 2018 -0500
Output for all the tests in the original description are identical to the examples above.
Interestingly, the behavior changes slightly for a CSV file which has headers:
%%file example_with_header.csv
A,B
1,2
3,4
5,6
7,8
9,10
11,12
with open('example_with_header.csv') as f:
data = pd.read_csv(f, nrows=1, engine='python')
print(f.readline())
print(f.readline())
print(data)
5,6
7,8
A B
0 1 2
So the Python CSV parser skips one line less compared to a headerless file.
I looked into the Python CSV parser by debugging how example_with_header.csv
is parsed. I followed how the read pointer advances by calling .tell()
for the file handle in the debugger.
In PythonParser.__init__()
,
self._infer_columns()
consumes the header rowself._get_index_name()
consumes and buffers the first and second data rows, regardless of nrows
Before the block which calls self._get_index_name()
, there is this comment:
# needs to be cleaned/refactored
# multiple date column thing turning into a real spaghetti factory
This is what PythonParser._get_index_name()
advertises itself to be about:
Try several cases to get lines:
0) There are headers on row 0 and row 1 and their
total summed lengths equals the length of the next line.
Treat row 0 as columns and row 1 as indices
1) Look for implicit index: there are more columns
on row 1 than row 0. If this is true, assume that row
1 lists index columns and row 0 lists normal columns.
2) Get index from the columns if it was listed.
This may be related to Index columns and trailing delimiters (link to Pandas IO Tools documentation).
Also @mcocdawc, I don't believe #2071 is related, since it's about the C engine CSV parser failing on a file handle which has already been partially iterated over.
Specifically, @wesm comments:
The underlying problem is that the new parser relies on being able to call read on the file handle you pass. however, after iterating, this causes:
ValueError: Mixing iteration and read methods would lose data
01bd5a2 adds a test case which shows the behavior when parsing text files and using the nrows=
option. The position of the input stream after parsing using the nrows=
option is different between the C and Python parse engines. Also visible is the fact that the Python parse engine consumes at least two rows even if nrows=1
.
The C parser engine seems to read data in 256 kibibyte blocks, and doesn't seek back to actual data end even if nrows
stops parsing early:
import pandas as pd
import os
csv_path = 'pandas/tests/io/sas/data/DEMO_G.csv'
csv_size_kib = os.path.getsize(csv_path) / 1024.
print(f'{csv_size_kib} KiB in {csv_path}')
with open(csv_path) as f:
df = pd.read_csv(f, nrows=1)
csv_pos_after_read = f.tell() / 1024.
print(f'{csv_pos_after_read} KiB read from {csv_path}')
print(f'{df.shape} is the shape of data read from {csv_path}')
1303.5673828125 KiB in pandas/tests/io/sas/data/DEMO_G.csv
256.0 KiB read from pandas/tests/io/sas/data/DEMO_G.csv
(1, 48) is the shape of data read from pandas/tests/io/sas/data/DEMO_G.csv
The 256 KiB chunk size is hard-coded. See:
@akaihola : Thanks for this investigation! If you're able to put all of this analysis together into a PR, that would be great!
I'm trying to think of solutions on the C parser side. Without knowing all the details of the parser, I wonder if the following would work:
nrows
), seek back to the correct position@gfyoung is this feasible at all? Do you think there would be a significant performance penalty from reading in one-byte chunks?
I like the files proposition, but I'm a little wary of the HTTP responses proposal. That being said, I can't argue with hard numbers. Give it a shot and if possible, benchmark it.
Pandas' support for the "index columns in header" CSV format leads to this tricky example: The parser correctly detects that the first two rows have fewer columns than the third. Based on this, it assumes that
state
, year
)population
, under18
)The catch here is that there are as many index columns as data columns (2 + 2 = 4
). For that reason, it's not possible to avoid reading three rows before being able to decide whether the second row contains index column names or actual data.
Thus, it won't even be possible to make nrows=1
always behave "correctly", i.e. only consume only one row of data after headers, unless yet another new option is added to explicitly define whether a multi-row header is present.
(nrows=0
is an even more pathological case of course)
Edit: ...unless of course the index column detection logic is allowed to correct itself after the fact while parsing the first actual data row. But that would need a major overhaul of the parser code base, and would certainly complicate it considerably.
C parser support and documentation: Interestingly, the "index columns in header" feature isn't supported at all in the C parser – so actually there's more opportunity for fixing nrows=1
there as long as this feature is kept out.
Also, I don't think the feature is actually documented on the IO Tools page nor in read_csv() API documentation.
Note about docstring: Somehow the docstring for _get_index_name() confuses me. The explanation for "case 1" is either confusing or plain wrong:
Try several cases to get lines:
0) There are headers on row 0 and row 1 and their total summed lengths equals the length of the next line. Treat row 0 as columns and row 1 as indices 1) Look for implicit index: there are more columns on row 1 than row 0. If this is true, assume that row 1 lists index columns and row 0 lists normal columns. 2) Get index from the columns if it was listed.
The bolded part should in my opinion say "row 0 lists normal columns, and additional leading columns in data are treated as index columns".
Anyway, thanks @gfyoung for your comments. I'll continue working on this on spare time, and I'll try to come up with PRs for at least some of the discussed issues.
I can see two alternatives for a new pd.read_csv()
/ pandas.io.parsers.PythonParser
option which could prevent unintended consumption of more than nrows
data rows:
row_index_header={None|False|True}
None
: current behavior (the default)nrows
data rowsFalse
: don't expect row index definitions in the headernrows
data rowsTrue
: expect row index definitions in the headernrows
data rowsallow_lookahead={False|True}
True
: current behavior (the default)nrows
data rowsFalse
: disallow row index definitions in the headernrows
data rowsAre there any updates on the issue?
This problem still seems to exist in version 0.25.2
.
When I read_csv()
with a specific number of nrows
- the returned DF is indeed nrows
long, but the file pointer overshoots by a few thousands of lines - making it impossible to keep on reading the file from the desired point.
If there is no fix yet for this problem, maybe there is at least some kind of workaround?
The only option I've found was to switch to the python
engine - but it's way too slow for large files.
@igiloh I have a likely related issue. I have a file that is essentially two separate CSV files smooshed into one separated by a blank row. That's all fine and dandy but limiting the function with nrows seems to not stop the function from interpreting the following CSV and 'smooshing' the two together despite nrows argument ending the reader before the second "csv" is reached.
@andydish From what I could gather, the C implementation of read_csv()
reads a chunk from the file into a buffer, than parses it and stops once nrows
was reached. The user has no control over the chunk size - so the actual file pointer would move forward beyond the nrows
th line.
The workaround I ended up doing was:
with open(fname, 'rb') as file:
def read_length(length):
before = file.tell()
data = pd.read_csv(file,
float_precision='high',
nrows=length).values
file.seek(before)
for i in range(length):
next(file)
return data
first_part = read_length(len1)
second_part = read_length(len2)
(not proud of it, but it worked...)
@igiloh That is where I was starting to head in my own solution. Thanks for sharing and saving me some time!
Code Sample, a copy-pastable example if possible
Problem description
The Issue https://github.com/pandas-dev/pandas/issues/2071 is probably related. The c-parser exhaustes the file handler even if nrows is passed.
The python-parser shows unexpected behaviour, when
nrows=1
ornrows=2
is given.Expected Output
Output of
pd.show_versions()