pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.92k stars 18.03k forks source link

BUG: Reading large CSV files with `pyarrow` when values contain newline character. #59009

Open matteosantama opened 5 months ago

matteosantama commented 5 months ago

Pandas version checks

Reproducible Example

import pandas as pd

rows = []
for i in range(1_000_000):
    rows.append({"text": "ab\ncd", "i": i})

df = pd.DataFrame(rows)
df.to_csv("./example.csv", index=False)
pd.read_csv("./example.csv", engine="pyarrow")

Issue Description

pd.read_csv fails when reading large CSV files with engine="payarrow" if values contain newline characters. The error is

ParserError: CSV parser got out of sync with chunker. This can mean the 
data file contains cell values spanning multiple lines; please consider 
enabling the option 'newlines_in_values'.

Note the file must be large to trigger the error. Either pandas should enable this flag internally, or expose the option to the user.

Expected Behavior

Reading the file succeeds with engine="python" and I would expect consistency between the two options.

Installed Versions

In [7]: pd.show_versions() INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.9.final.0 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : None pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
matteosantama commented 5 months ago

From the latest pyarrow documentation

newlines_in_values, optional (default False) Whether newline characters are allowed in CSV values. Setting this to True reduces the performance of multi-threaded CSV reading.

Enabling it by default would probably be a mistake. The pyarrow engine (with its multi-threaded capabilities) is the preferred option for large CSV files, though, so it'd be a shame for it to fail in this scenario.

If the pyarrow engine is here to stay, I'd recommend exposing newlines_in_values to the user.

tilovashahrin commented 5 months ago

To keep the pyarrow engine, you'll need to use the pyarrow library directly to handle CSV files that contain newline characters. This involves using the ParseOptions class from pyarrow.csv to set the newlines_in_values option to True.

Example

import pyarrow as pa
import pandas as pd

rows = []
for i in range(1_000_000):
    rows.append({"text": "ab\ncd", "i": i})

df = pd.DataFrame(rows)
# Define parse options to allow newlines in values
parse_options = pv.ParseOptions(newlines_in_values=True)

# Read the CSV file using pyarrow
table = pv.read_csv("example.csv", parse_options=parse_options)

# Convert the Arrow Table to a Pandas DataFrame
df = table.to_pandas()
df
gosuchoi commented 4 months ago

take

wooseogchoi commented 3 months ago

take

wooseogchoi commented 3 months ago

take

wooseogchoi commented 3 months ago

@WillAyd I would like to introduce a new argument in order to expose pyarrow's 'newlines_in_values' to the user because I cannot find any suitable in the current parameters. Could you please suggest new parametrer name for this, 'newlines_in_values' which might be used by another engines in the future.

WillAyd commented 3 months ago

Reading through the issue I don't think we actually want to change anything here - the solution from @tilovashahrin should work.

Can you check if that works for you? If so, we should add a test for it to pandas (if one doesn't already exist) and maybe update the documentation to show how to do it

wooseogchoi commented 3 months ago

@WillAyd With some modification, the codes above are working. I will add it as example in the read_csv doc. Also I will check the test cases. If it is not there, I will add one. Thx

wooseogchoi commented 2 months ago

@WillAyd Hi, I set up PR to resolve this issue. As part of this, I added one test case with pyarrow. Whenever I ran it in my environment, it always passed. The exception raised. However, when I uploaded it to PR, the checks in PR failed due to NOT raising exception. Can you please help me? thank you

wooseogchoi commented 2 months ago

https://github.com/pandas-dev/pandas/pull/59754