Open matteosantama opened 5 months ago
From the latest pyarrow documentation
newlines_in_values, optional (default False) Whether newline characters are allowed in CSV values. Setting this to True reduces the performance of multi-threaded CSV reading.
Enabling it by default would probably be a mistake. The pyarrow engine (with its multi-threaded capabilities) is the preferred option for large CSV files, though, so it'd be a shame for it to fail in this scenario.
If the pyarrow engine is here to stay, I'd recommend exposing newlines_in_values
to the user.
To keep the pyarrow engine, you'll need to use the pyarrow library directly to handle CSV files that contain newline characters. This involves using the ParseOptions class from pyarrow.csv to set the newlines_in_values option to True.
import pyarrow as pa
import pandas as pd
rows = []
for i in range(1_000_000):
rows.append({"text": "ab\ncd", "i": i})
df = pd.DataFrame(rows)
# Define parse options to allow newlines in values
parse_options = pv.ParseOptions(newlines_in_values=True)
# Read the CSV file using pyarrow
table = pv.read_csv("example.csv", parse_options=parse_options)
# Convert the Arrow Table to a Pandas DataFrame
df = table.to_pandas()
df
take
take
take
@WillAyd I would like to introduce a new argument in order to expose pyarrow's 'newlines_in_values' to the user because I cannot find any suitable in the current parameters. Could you please suggest new parametrer name for this, 'newlines_in_values' which might be used by another engines in the future.
Reading through the issue I don't think we actually want to change anything here - the solution from @tilovashahrin should work.
Can you check if that works for you? If so, we should add a test for it to pandas (if one doesn't already exist) and maybe update the documentation to show how to do it
@WillAyd With some modification, the codes above are working. I will add it as example in the read_csv doc. Also I will check the test cases. If it is not there, I will add one. Thx
@WillAyd Hi, I set up PR to resolve this issue. As part of this, I added one test case with pyarrow. Whenever I ran it in my environment, it always passed. The exception raised. However, when I uploaded it to PR, the checks in PR failed due to NOT raising exception. Can you please help me? thank you
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
pd.read_csv
fails when reading large CSV files withengine="payarrow"
if values contain newline characters. The error isNote the file must be large to trigger the error. Either pandas should enable this flag internally, or expose the option to the user.
Expected Behavior
Reading the file succeeds with
engine="python"
and I would expect consistency between the two options.Installed Versions