[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Small file example - 1 batch instead of 10 batches
import polars as pl
# Create a DataFrame with values from 0 to 99
data = pl.DataFrame({"values": range(100)})
# Save the data to a single CSV file
data.write_csv("data.csv")
# Read the CSV file in batches using polars
batch_size = 10
csv_reader = pl.read_csv_batched("data.csv", has_header=True, batch_size=batch_size)
# Collect all batches in a loop
batches = []
while True:
batch = csv_reader.next_batches(1)
if not batch:
break
batches.extend(batch)
# Concatenate all batches into a single DataFrame
all_data = pl.concat(batches)
# Assert that the data read matches the original data
assert (
all_data.shape[0] == 100
), "The number of rows read does not match the expected number of rows."
assert (
all_data.shape[1] == 1
), "The number of columns read does not match the expected number of columns."
assert all_data["values"].to_list() == list(
range(100)
), "The data read does not match the expected values."
# Assert the number of batches
assert (
len(batches) >= 7 and len(batches) <= 13
), f"The number of batches ({len(batches)}) read does not match the expected number of batches (7 .. 13)."
print("All assertions passed. Data read successfully and validated.")
# works with polars==1.12.0
# fails with polars==1.13.0 and 1.14.0, number of batches = 1
Big file example with 321 batches instead of 2 batches
import random
import string
import polars as pl
# Create a DataFrame with 50,000 lines
num_rows = 50000
values = range(num_rows)
strings = [
"".join(random.choices(string.ascii_letters + string.digits, k=1000)) for _ in range(num_rows)
]
data = pl.DataFrame({"values": values, "strings": strings})
# Save the data to a single CSV file
data.write_csv("data.csv")
# Read the CSV file in batches using polars
batch_size = 40000
csv_reader = pl.read_csv_batched("data.csv", has_header=True, batch_size=batch_size)
# Collect all batches in a loop
batches = []
while True:
batch = csv_reader.next_batches(1)
if not batch:
break
batches.extend(batch)
# Concatenate all batches into a single DataFrame
all_data = pl.concat(batches)
# Assert that the data read matches the original data
assert (
all_data.shape[0] == num_rows
), "The number of rows read does not match the expected number of rows."
assert (
all_data.shape[1] == 2
), "The number of columns read does not match the expected number of columns."
assert all_data["values"].to_list() == list(
values
), "The 'values' column data read does not match the expected values."
assert (
all_data["strings"].to_list() == strings
), "The 'strings' column data read does not match the expected values."
# Assert the number of batches
expected_batches = (num_rows + batch_size - 1) // batch_size # Calculate expected number of batches
assert (
len(batches) == expected_batches
), f"The number of batches ({len(batches)}) read does not match the expected number of batches ({(num_rows + batch_size - 1) // batch_size })."
print("All assertions passed. Data read successfully and validated.")
# works with polars==1.12.0
# fails with polars==1.13.0 and 1.14.0, number of batches = 321
Log output
Traceback (most recent call last):
File "C:\Users\mysers\docs\bug_report_polars_read_csv_batches_big.py", line 49, in <module>
len(batches) == expected_batches
AssertionError: The number of batches (321) read does not match the expected number of batches (2).
Issue description
The function read_csv_batched() allows to set the parameter batch_size.
This parameters should be respected when reading a csv file with next_batches(1) - at least more or less.
In version 1.12.0, this worked quite well.
Since version 1.13.0, this does not work anymore. In the example above, we get 321 instead of 2 batches for a bigger file, and 1 batch instead of 7 (or 10) batches.
Expected behavior
Use version 1.12, then the assert statements will work.
This should also work after the bug fix.
Checks
Reproducible example
Small file example - 1 batch instead of 10 batches
Big file example with 321 batches instead of 2 batches
Log output
Issue description
The function
read_csv_batched()
allows to set the parameterbatch_size
.This parameters should be respected when reading a csv file with
next_batches(1)
- at least more or less. In version 1.12.0, this worked quite well. Since version 1.13.0, this does not work anymore. In the example above, we get 321 instead of 2 batches for a bigger file, and 1 batch instead of 7 (or 10) batches.Expected behavior
Use version 1.12, then the assert statements will work. This should also work after the bug fix.
Installed versions