pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

read_csv_batched not working when separator is included in the field #16953

Open MikeXydas opened 4 months ago

MikeXydas commented 4 months ago

Checks

Reproducible example

# error.csv
a,b
"test","test"
"test","test"
"test",",,"  # Notice here that we have 2 commas

# correct.csv
a,b
"test","test"
"test","test"
"test",","  # Here we only have 1 comma
import polars as pl

reader_error = pl.read_csv_batched("error.csv", separator=",", batch_size=1, quote_char="\"")
batch = reader_error.next_batches(2)
print(len(batch))  # Prints 1, wrong
print(batch)

reader_correct = pl.read_csv_batched("correct.csv", separator=",", batch_size=1, quote_char="\"")
batch = reader_correct.next_batches(2)
print(len(batch))  # Prints 2, correct
print(batch)

Log output

1
[shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ test ┆ test │
│ test ┆ test │
│ test ┆ ,,   │
└──────┴──────┘]
2
[shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ test ┆ test │
└──────┴──────┘, shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ test ┆ test │
└──────┴──────┘]

Issue description

I am trying to read a relatively large csv +30M rows that cannot fit into memory so I am using read_csv_batched. However, I noticed that reader.next_batch(5) instead of returning number of batches dfs (in our case 5) it always returned 1 df with all the rows inside (bigger than the given batch size).

The issue seems to occur due to the , character but since we are using " it should be escaped and not affect the batch reader. Note that this is a minimum example. In the real scenario we had batch_size = 100,000 and still the whole csv was read in a single DataFrame of 30M rows.

(Posted in SO first: https://stackoverflow.com/questions/78616907/polars-issue-with-read-csv-batched-when-separator-is-included-in-the-field)

Expected behavior

The expected behavior should be the one shown in the correct.csv example where 2 batches of size 1 are created:

[shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ test ┆ test │
└──────┴──────┘, 
shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ test ┆ test │
└──────┴──────┘]

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2023.10.0 gevent: hvplot: matplotlib: 3.8.2 nest_asyncio: 1.6.0 numpy: 1.26.3 openpyxl: pandas: 2.2.0 pyarrow: 15.0.0 pydantic: 2.6.0 pyiceberg: pyxlsb: sqlalchemy: 2.0.27 torch: 2.2.0+cu121 xlsx2csv: xlsxwriter: ```
raayu83 commented 1 day ago

Also stumbled upon this today. Would be great if this could be fixed.