read_csv with filehandler and nrows argument

mcocdawc commented 7 years ago

Code Sample, a copy-pastable example if possible

%%file example.csv
1,2
3,4
5,6
7,8
9,10
11,12

import pandas as pd

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2)
    print(f.readline())

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=1, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=3, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

Problem description

The Issue https://github.com/pandas-dev/pandas/issues/2071 is probably related. The c-parser exhaustes the file handler even if nrows is passed.

The python-parser shows unexpected behaviour, when nrows=1 or nrows=2 is given.

Expected Output

with open('example.csv') as f:
    data = pd.read_csv(f, names=['A', 'B'], nrows=2, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-27-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.3 pytest: 3.1.2 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.2 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.0-b1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.8.0 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.11 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

gfyoung commented 7 years ago

I suspect the problem is that we aren't stopping the parsing for the C engine when we passing nrows. I think we try to respect it in some cases, but those are likely insufficient it seems.

As for the Python parser, that does bewilder me a bit. We are wrapping Python's csv reader class in most cases, so I suspect something is going with the interaction that causes it output the "weird" results that you are seeing.

PR to investigate and patch is welcome!

ewquon commented 6 years ago

Can confirm this problem still exists with version 0.21

akaihola commented 6 years ago

Can confirm this problem also still exists as of master branch on May 31, 2018:

commit 4274b840e64374a39a0285c2174968588753ec35
Date:   Thu May 31 19:14:33 2018 -0500

Output for all the tests in the original description are identical to the examples above.

akaihola commented 6 years ago

Interestingly, the behavior changes slightly for a CSV file which has headers:

%%file example_with_header.csv
A,B
1,2
3,4
5,6
7,8
9,10
11,12

with open('example_with_header.csv') as f:
    data = pd.read_csv(f, nrows=1, engine='python')
    print(f.readline())
    print(f.readline())
    print(data)

So the Python CSV parser skips one line less compared to a headerless file.

akaihola commented 6 years ago

I looked into the Python CSV parser by debugging how example_with_header.csv is parsed. I followed how the read pointer advances by calling .tell() for the file handle in the debugger.

In PythonParser.__init__(),

the call to self._infer_columns() consumes the header row
the call to self._get_index_name() consumes and buffers the first and second data rows, regardless of nrows
actual reading of data is done by concatenating the buffer with further lines read from the stream

Before the block which calls self._get_index_name(), there is this comment:

        # needs to be cleaned/refactored
        # multiple date column thing turning into a real spaghetti factory

This is what PythonParser._get_index_name() advertises itself to be about:

        Try several cases to get lines:

        0) There are headers on row 0 and row 1 and their
        total summed lengths equals the length of the next line.
        Treat row 0 as columns and row 1 as indices
        1) Look for implicit index: there are more columns
        on row 1 than row 0. If this is true, assume that row
        1 lists index columns and row 0 lists normal columns.
        2) Get index from the columns if it was listed.

This may be related to Index columns and trailing delimiters (link to Pandas IO Tools documentation).

akaihola commented 6 years ago

Also @mcocdawc, I don't believe #2071 is related, since it's about the C engine CSV parser failing on a file handle which has already been partially iterated over.

Specifically, @wesm comments:

The underlying problem is that the new parser relies on being able to call read on the file handle you pass. however, after iterating, this causes:
ValueError: Mixing iteration and read methods would lose data 

akaihola commented 6 years ago

01bd5a2 adds a test case which shows the behavior when parsing text files and using the nrows= option. The position of the input stream after parsing using the nrows= option is different between the C and Python parse engines. Also visible is the fact that the Python parse engine consumes at least two rows even if nrows=1.

akaihola commented 6 years ago

The C parser engine seems to read data in 256 kibibyte blocks, and doesn't seek back to actual data end even if nrows stops parsing early:

import pandas as pd
import os

csv_path = 'pandas/tests/io/sas/data/DEMO_G.csv'

csv_size_kib = os.path.getsize(csv_path) / 1024.
print(f'{csv_size_kib} KiB in {csv_path}')

with open(csv_path) as f:
    df = pd.read_csv(f, nrows=1)
    csv_pos_after_read = f.tell() / 1024.
    print(f'{csv_pos_after_read} KiB read from {csv_path}')

print(f'{df.shape} is the shape of data read from {csv_path}')

1303.5673828125 KiB in pandas/tests/io/sas/data/DEMO_G.csv
256.0 KiB read from pandas/tests/io/sas/data/DEMO_G.csv
(1, 48) is the shape of data read from pandas/tests/io/sas/data/DEMO_G.csv

The 256 KiB chunk size is hard-coded. See:

gfyoung commented 6 years ago

@akaihola : Thanks for this investigation! If you're able to put all of this analysis together into a PR, that would be great!

akaihola commented 6 years ago

I'm trying to think of solutions on the C parser side. Without knowing all the details of the parser, I wonder if the following would work:

for seekable streams (typically files):
- read in 256 KiB chunks as before to keep the performance
- when parsing, keep count of how many bytes have been consumed
- if parsing stops early (e.g. due to nrows), seek back to the correct position
for non-seekable streams (e.g. HTTP responses):
- read in 1 byte chunks

@gfyoung is this feasible at all? Do you think there would be a significant performance penalty from reading in one-byte chunks?

gfyoung commented 6 years ago

I like the files proposition, but I'm a little wary of the HTTP responses proposal. That being said, I can't argue with hard numbers. Give it a shot and if possible, benchmark it.

akaihola commented 6 years ago

Pandas' support for the "index columns in header" CSV format leads to this tricky example: two-header-rows The parser correctly detects that the first two rows have fewer columns than the third. Based on this, it assumes that

the second row indicates the names of the leading index columns (state, year)
the first row indicates the names of the trailing data columns (population, under18)

The catch here is that there are as many index columns as data columns (2 + 2 = 4). For that reason, it's not possible to avoid reading three rows before being able to decide whether the second row contains index column names or actual data.

Thus, it won't even be possible to make nrows=1 always behave "correctly", i.e. only consume only one row of data after headers, unless yet another new option is added to explicitly define whether a multi-row header is present.

(nrows=0 is an even more pathological case of course)

Edit: ...unless of course the index column detection logic is allowed to correct itself after the fact while parsing the first actual data row. But that would need a major overhaul of the parser code base, and would certainly complicate it considerably.

C parser support and documentation: Interestingly, the "index columns in header" feature isn't supported at all in the C parser – so actually there's more opportunity for fixing nrows=1 there as long as this feature is kept out.

Also, I don't think the feature is actually documented on the IO Tools page nor in read_csv() API documentation.

Note about docstring: Somehow the docstring for _get_index_name() confuses me. The explanation for "case 1" is either confusing or plain wrong:

Try several cases to get lines:

0) There are headers on row 0 and row 1 and their total summed lengths equals the length of the next line. Treat row 0 as columns and row 1 as indices 1) Look for implicit index: there are more columns on row 1 than row 0. If this is true, assume that row 1 lists index columns and row 0 lists normal columns. 2) Get index from the columns if it was listed.

The bolded part should in my opinion say "row 0 lists normal columns, and additional leading columns in data are treated as index columns".

Anyway, thanks @gfyoung for your comments. I'll continue working on this on spare time, and I'll try to come up with PRs for at least some of the discussed issues.

akaihola commented 6 years ago

I can see two alternatives for a new pd.read_csv() / pandas.io.parsers.PythonParser option which could prevent unintended consumption of more than nrows data rows:

row_index_header={None|False|True}
- None: current behavior (the default)
- might consume more than nrows data rows
- False: don't expect row index definitions in the header
- always interpret first row after column header as data
- prevents consuming more than nrows data rows
- True: expect row index definitions in the header
- always intrepret first row after column header as row index definitions
- prevents consuming more than nrows data rows
allow_lookahead={False|True}
- True: current behavior (the default)
- might consume more than nrows data rows
- False: disallow row index definitions in the header
- interpret first row after column header as data
- prevents consuming more than nrows data rows

igiloh commented 4 years ago

Are there any updates on the issue?

This problem still seems to exist in version 0.25.2.
When I read_csv() with a specific number of nrows - the returned DF is indeed nrows long, but the file pointer overshoots by a few thousands of lines - making it impossible to keep on reading the file from the desired point.

If there is no fix yet for this problem, maybe there is at least some kind of workaround?
The only option I've found was to switch to the python engine - but it's way too slow for large files.

andydish commented 2 years ago

@igiloh I have a likely related issue. I have a file that is essentially two separate CSV files smooshed into one separated by a blank row. That's all fine and dandy but limiting the function with nrows seems to not stop the function from interpreting the following CSV and 'smooshing' the two together despite nrows argument ending the reader before the second "csv" is reached.

igiloh commented 2 years ago

@andydish From what I could gather, the C implementation of read_csv() reads a chunk from the file into a buffer, than parses it and stops once nrows was reached. The user has no control over the chunk size - so the actual file pointer would move forward beyond the nrowsth line.

The workaround I ended up doing was:

with open(fname, 'rb') as file:
    def read_length(length):
        before = file.tell()
        data = pd.read_csv(file,
                           float_precision='high',
                           nrows=length).values
        file.seek(before)
        for i in range(length):
            next(file)
        return data

    first_part = read_length(len1)
    second_part = read_length(len2)

(not proud of it, but it worked...)

andydish commented 2 years ago

@igiloh That is where I was starting to head in my own solution. Thanks for sharing and saving me some time!

pandas-dev / pandas