pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

In read_csv function, if use_pyarrow=True then comment_prefix is ignored #19610

Open biiiipy opened 1 day ago

biiiipy commented 1 day ago

Checks

Reproducible example

test.csv:

column
aaaaaaaa
#comment

this returns commented rows starting with '#', but it shouldn't:

pl.read_csv('test.csv', use_pyarrow=True, comment_prefix='#')

Log output

 ┌────────────────────────────┐
 │ column                     │
 │ ---                        │
 │ str                        │
 ╞════════════════════════════╡
 │ aaaaaaaa                   │
 │ #comment                   │
 └────────────────────────────┘

Issue description

pyarrow doesn't have an option to define comment rows and skip them, so that complicates a fix for this

Expected behavior

read_csv should not return #comment row. read_csv should either warn/error if both use_pyarrow=True and comment_prefix are used, or remove comment rows from the dataframe as an additional step

Installed versions

``` --------Version info--------- Polars: 1.12.0 Index type: UInt32 Platform: Windows-10-10.0.22631-SP0 Python: 3.11.4 (tags/v3.11.4:d2340ef, Jun 7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.5.7 numpy 1.26.4 openpyxl pandas 2.2.3 pyarrow 17.0.0 pydantic 2.9.2 pyiceberg sqlalchemy 2.0.35 torch xlsx2csv xlsxwriter ```
ritchie46 commented 1 day ago

Pyarrow doesn't support a char, but an invalid row handler callback. https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions

We could create that callable for pyarrow. Or we could raise an exception saying that we don't support that combination with pyarrow. Given that pyarrow goes back into python, this will have terrible performance and is not something we normally would accept. I think we should raise.

cmdlineluser commented 23 hours ago

For reference, pandas also raises:

pd.read_csv("", comment="#", engine="pyarrow")
# ValueError: The 'comment' option is not supported with the 'pyarrow' engine