[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
# error.csv
a,b
"test","test"
"test","test"
"test",",," # Notice here that we have 2 commas
# correct.csv
a,b
"test","test"
"test","test"
"test","," # Here we only have 1 comma
1
[shape: (3, 2)
┌──────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════╡
│ test ┆ test │
│ test ┆ test │
│ test ┆ ,, │
└──────┴──────┘]
2
[shape: (1, 2)
┌──────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════╡
│ test ┆ test │
└──────┴──────┘, shape: (1, 2)
┌──────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════╡
│ test ┆ test │
└──────┴──────┘]
Issue description
I am trying to read a relatively large csv +30M rows that cannot fit into memory so I am using read_csv_batched. However, I noticed that reader.next_batch(5) instead of returning number of batches dfs (in our case 5) it always returned 1 df with all the rows inside (bigger than the given batch size).
The issue seems to occur due to the , character but since we are using " it should be escaped and not affect the batch reader.
Note that this is a minimum example. In the real scenario we had batch_size = 100,000 and still the whole csv was read in a single DataFrame of 30M rows.
Checks
Reproducible example
Log output
Issue description
I am trying to read a relatively large csv +30M rows that cannot fit into memory so I am using
read_csv_batched
. However, I noticed thatreader.next_batch(5)
instead of returning number of batches dfs (in our case 5) it always returned 1 df with all the rows inside (bigger than the given batch size).The issue seems to occur due to the
,
character but since we are using"
it should be escaped and not affect the batch reader. Note that this is a minimum example. In the real scenario we hadbatch_size = 100,000
and still the whole csv was read in a single DataFrame of 30M rows.(Posted in SO first: https://stackoverflow.com/questions/78616907/polars-issue-with-read-csv-batched-when-separator-is-included-in-the-field)
Expected behavior
The expected behavior should be the one shown in the correct.csv example where 2 batches of size 1 are created:
Installed versions