pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

DOC: clarify how read_csv nrows interacts with header and skiprows argument #59078

Open mdavis-xyz opened 1 week ago

mdavis-xyz commented 1 week ago

Pandas version checks

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Documentation problem

The documentation for read_csv's nrows argument says:

Number of rows of file to read. Useful for reading pieces of large files.

I want to read a file using header=1, and then limit the number of rows. The documentation says this counts the number of rows of the file. To me that sounds like it includes the skipped row and the column header row, since pandas still reads those rows from the file. But I've done some testing. The nrows argument counts the number of data rows. It excludes the skipped rows, and excludes the column header row. skiprows is the same (skipped rows aren't counted towards nrows). When I have a row which is a comment, that also doesn't count towards nrows.

import pandas as pd
csv = """extra,
a,b
1,1
#comment,comment
2,2
3,3
footer,blah,yeah
"""
from io import StringIO
with StringIO(csv) as io:
    df = pd.read_csv(io, header=1, nrows=2, comment='#')

For nrows=2, it seems to always return 2 rows.

Suggested fix for documentation

Number of rows of data to read. Useful for reading pieces of large files. Refers to the number of included data rows. The following rows are not included in the count:

  • the column header
  • rows before the column header, if header=1 or larger
  • rows which are fully comment rows
  • rows skipped with skiprows
Aloqeely commented 1 week ago

Thanks for the suggestion. PRs are welcomed!

mdavis-xyz commented 1 week ago

Ok. I'll do that in the next week.

Note to myself: I need to test how rows with quoted newlines within string cells are counted.