pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.97k stars 1.93k forks source link

Regression in 1.10.0: ComputeError when parsing quoted string #19432

Open mihai-afternet opened 1 day ago

mihai-afternet commented 1 day ago

Checks

Reproducible example

Fails:

import polars as pl
import io

data = '''Name
"test test" test
another name
'''

df = pl.read_csv(io.StringIO(data)

Works:

import polars as pl
import io

data = '''Name
"test test" test
another name
'''

df = pl.read_csv(io.StringIO(data), quote_char=None)

Or:

import polars as pl
import io

df = pl.DataFrame({
    "name": ['"test test" test']
})

csv_buffer = io.StringIO()
df.write_csv(csv_buffer)
csv_buffer.seek(0)

df_from_csv = pl.read_csv(csv_buffer)

Log output

pydf = PyDataFrame.read_csv(
           ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not parse `"test test" test` as dtype `str` at column 'Name' (column number 1)

The current offset in the file is 5 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `"test test" test` to the `null_values` list.

Original error:  csv file

Field `"test test" test` is not properly escaped.

Issue description

I encountered a ComputeError when attempting to parse a string in a CSV column using Polars version 1.10.0. The string in question is "test test" test, which should be parsed as a valid string. This issue did not occur in previous versions of Polars, making it a regression introduced in 1.10.0.

I can use the quote_char=None parameter to overcome the issue.

Expected behavior

The string <"test test" test> should be successfully parsed as a valid string in the 'Name' column without raising an error

Installed versions

``` --------Version info--------- Polars: 1.11.0 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.12.0 (tags/v3.12.0:0fb18b0, Oct 2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec 2024.9.0 gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 2.1.1 openpyxl pandas 2.2.2 pyarrow pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
coastalwhite commented 1 day ago

This was caused by #19124. @ritchie46 can you have a look.

ritchie46 commented 19 hours ago

It's actually a correct error. The value is incorrectly escaped. It should be """test test"" test", enclosing the entire field in " and doubling (escaping) internal ".

Previously we read it incorrectly, so this was a bug fix.

Filimoa commented 9 hours ago

Is there a workaround? I imagine many people don't have control over the underlying data so this makes it impossible to read certain datasets with polars.

ritchie46 commented 8 hours ago

Yes, Set a different quoting value if the data isn't quoted properly. The error gives a few tips.

Reading it in if the quote char is set to ", isn't an option. It's invalid csv.