pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.67k stars 1.79k forks source link

Polars raising `PanicException` from `pl.read_csv` when trying to parse a string with escaped `"` at the start and `”` at the end #11308

Open pall-j opened 10 months ago

pall-j commented 10 months ago

Checks

Reproducible example

Python 3.10.10 Polars 0.19.3

from io import StringIO
import polars as pl

csv = StringIO('''
str_col
"ok"
"""error”"
ok
''')

df = pl.read_csv(csv, dtypes={'str_col': pl.Utf8}, try_parse_dates=True)
print(df)

Log output

With RUST_BACKTRACE=full

thread '<unnamed>' panicked at /Users/runner/work/polars/polars/crates/polars-io/src/csv/utils.rs:112:63:
byte index 9 is not a char boundary; it is inside '”' (bytes 7..10) of `""error”`
stack backtrace:
   0:        0x112038380 - _ffi_select_with_compiled_path
   1:        0x110c278ec - _BrotliDecoderVersion
   2:        0x11201456c - _ffi_select_with_compiled_path
   3:        0x11203c2b8 - _ffi_select_with_compiled_path
   4:        0x11203b908 - _ffi_select_with_compiled_path
   5:        0x11203c8e0 - _ffi_select_with_compiled_path
   6:        0x11203c440 - _ffi_select_with_compiled_path
   7:        0x11203c3b4 - _ffi_select_with_compiled_path
   8:        0x11203c3a8 - _ffi_select_with_compiled_path
   9:        0x112197edc - _ffi_select_with_compiled_path
  10:        0x110c2d9b8 - _BrotliDecoderVersion
  11:        0x112198388 - _ffi_select_with_compiled_path
  12:        0x1116437f0 - _ffi_select_with_compiled_path
  13:        0x11163e328 - _ffi_select_with_compiled_path
  14:        0x1116391fc - _ffi_select_with_compiled_path
  15:        0x1105a0b28 - <unknown>
  16:        0x110716268 - <unknown>
  17:        0x11072ceec - <unknown>
  18:        0x1102d13b4 - <unknown>
  19:        0x1108611dc - _PyInit_polars
  20:        0x1006f23c4 - _cfunction_call
  21:        0x1006a9d74 - __PyObject_Call
  22:        0x10078a110 - __PyEval_EvalFrameDefault
  23:        0x101246a84 - ___pyx_f_18_pydevd_frame_eval_36pydevd_frame_evaluator_darwin_310_64_get_bytecode_while_frame_eval
  24:        0x1007839e0 - __PyEval_Vector
  25:        0x1006abe94 - _method_vectorcall
  26:        0x1006a9bdc - _PyVectorcall_Call
  27:        0x10078a0f8 - __PyEval_EvalFrameDefault
  28:        0x101246a84 - ___pyx_f_18_pydevd_frame_eval_36pydevd_frame_evaluator_darwin_310_64_get_bytecode_while_frame_eval
  29:        0x1007839e0 - __PyEval_Vector
  30:        0x10078c7f4 - _call_function
  31:        0x100789ed0 - __PyEval_EvalFrameDefault
  32:        0x101246a84 - ___pyx_f_18_pydevd_frame_eval_36pydevd_frame_evaluator_darwin_310_64_get_bytecode_while_frame_eval
  33:        0x1007839e0 - __PyEval_Vector
  34:        0x100783928 - _PyEval_EvalCode
  35:        0x1007802e4 - _builtin_exec
  36:        0x1006f1b3c - _cfunction_vectorcall_FASTCALL
  37:        0x10078c7f4 - _call_function
  38:        0x100789e58 - __PyEval_EvalFrameDefault
  39:        0x101246a84 - ___pyx_f_18_pydevd_frame_eval_36pydevd_frame_evaluator_darwin_310_64_get_bytecode_while_frame_eval
  40:        0x1007839e0 - __PyEval_Vector
  41:        0x10078c7f4 - _call_function
  42:        0x100789dd8 - __PyEval_EvalFrameDefault
  43:        0x101246a84 - ___pyx_f_18_pydevd_frame_eval_36pydevd_frame_evaluator_darwin_310_64_get_bytecode_while_frame_eval
  44:        0x1007839e0 - __PyEval_Vector
  45:        0x10078c7f4 - _call_function
  46:        0x100789da8 - __PyEval_EvalFrameDefault
  47:        0x1007839e0 - __PyEval_Vector
  48:        0x10078c7f4 - _call_function
  49:        0x100789da8 - __PyEval_EvalFrameDefault
  50:        0x1007839e0 - __PyEval_Vector
  51:        0x10078c7f4 - _call_function
  52:        0x100789e58 - __PyEval_EvalFrameDefault
  53:        0x1007839e0 - __PyEval_Vector
  54:        0x100783928 - _PyEval_EvalCode
  55:        0x1007d0b84 - __PyRun_SimpleFileObject
  56:        0x1007d05b4 - __PyRun_AnyFileObject
  57:        0x1007efabc - _Py_RunMain
  58:        0x1007eff2c - _pymain_main
  59:        0x1007effcc - _Py_BytesMain
Traceback (most recent call last):
  File "/Users/juraj/.pyenv/versions/3.10.10/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/juraj/Library/Caches/pypoetry/virtualenvs/tmp-yttreUDk-py3.10/lib/python3.10/site-packages/polars/io/_utils.py", line 83, in managed_file
    yield file
  File "/Users/juraj/Library/Caches/pypoetry/virtualenvs/tmp-yttreUDk-py3.10/lib/python3.10/site-packages/polars/io/csv/functions.py", line 364, in read_csv
    df = pl.DataFrame._read_csv(
  File "/Users/juraj/Library/Caches/pypoetry/virtualenvs/tmp-yttreUDk-py3.10/lib/python3.10/site-packages/polars/dataframe/frame.py", line 768, in _read_csv
    self._df = PyDataFrame.read_csv(
pyo3_runtime.PanicException: byte index 9 is not a char boundary; it is inside '”' (bytes 7..10) of `""error”`

Issue description

If we set try_parse_dates in the provided example to False or if we use schema instead of dtypes, then the data is loaded correctly without any problem. This is the behavior I would expected even in the example case.

There seem to be an internal bug in polars as unhandled PanicException is being raised.

Expected behavior

┌─────────┐
│ str_col │
│ ---     │
│ str     │
╞═════════╡
│ str_col │
│ ok      │
│ "error” │
│ ok      │
└─────────┘

Installed versions

``` --------Version info--------- Polars: 0.19.3 Index type: UInt32 Platform: macOS-13.3-arm64-arm-64bit Python: 3.10.10 (main, Mar 24 2023, 12:03:20) [Clang 14.0.0 (clang-1400.0.29.202)] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: numpy: pandas: pyarrow: pydantic: sqlalchemy: xlsx2csv: xlsxwriter: ```
mcrumiller commented 10 months ago

Can you create a minimal working example? For CSV files, it's useful to use StringIO for this:

from io import StringIO

import polars as pl

csv = StringIO("""
name,age
Frank,30
Michelle,58
Peter,78
""")
df = pl.read_csv(csv)
print(df)
shape: (3, 2)
┌──────────┬─────┐
│ name     ┆ age │
│ ---      ┆ --- │
│ str      ┆ i64 │
╞══════════╪═════╡
│ Frank    ┆ 30  │
│ Michelle ┆ 58  │
│ Peter    ┆ 78  │
└──────────┴─────┘
mcrumiller commented 10 months ago

Also, looking closely at the CSV text you provided, you have some non-ASCII quotes in there:

"""error”"

This is not properly escaped, you have 3 double quotes (Unicode 34), followed by a double end-quote (which Unicode 8221) and then a single double-quote.

pall-j commented 10 months ago

Also, looking closely at the CSV text you provided, you have some non-ASCII quotes in there:

"""error”"

This is not properly escaped, you have 3 double quotes (Unicode 34), followed by a double end-quote (which Unicode 8221) and then a single double-quote.

The first and last double quotes are the ones encapsulating the string, the 2. and 3. double quote are together forming a single escaped double quote. The Unicode 8221 should be kept there as is without need for any escaping (intentional non-ASCII character).

Thus the string should be parsed as "error” without any issues. This is done correctly as I wrote in case try_parse_dates is set to False or schema is used instead of dtypes.

pall-j commented 10 months ago

Can you create a minimal working example? For CSV files, it's useful to use StringIO for this:

from io import StringIO

import polars as pl

csv = StringIO("""
name,age
Frank,30
Michelle,58
Peter,78
""")
df = pl.read_csv(csv)
print(df)
shape: (3, 2)
┌──────────┬─────┐
│ name     ┆ age │
│ ---      ┆ --- │
│ str      ┆ i64 │
╞══════════╪═════╡
│ Frank    ┆ 30  │
│ Michelle ┆ 58  │
│ Peter    ┆ 78  │
└──────────┴─────┘

Updated