pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.58k stars 1.98k forks source link

read_csv_batched() does not respect batch sizes #19978

Open MarcusJellinghaus opened 3 days ago

MarcusJellinghaus commented 3 days ago

Checks

Reproducible example

Small file example - 1 batch instead of 10 batches

import polars as pl

# Create a DataFrame with values from 0 to 99
data = pl.DataFrame({"values": range(100)})

# Save the data to a single CSV file
data.write_csv("data.csv")

# Read the CSV file in batches using polars
batch_size = 10
csv_reader = pl.read_csv_batched("data.csv", has_header=True, batch_size=batch_size)

# Collect all batches in a loop
batches = []
while True:
    batch = csv_reader.next_batches(1)
    if not batch:
        break
    batches.extend(batch)

# Concatenate all batches into a single DataFrame
all_data = pl.concat(batches)

# Assert that the data read matches the original data
assert (
    all_data.shape[0] == 100
), "The number of rows read does not match the expected number of rows."
assert (
    all_data.shape[1] == 1
), "The number of columns read does not match the expected number of columns."
assert all_data["values"].to_list() == list(
    range(100)
), "The data read does not match the expected values."

# Assert the number of batches
assert (
    len(batches) >= 7 and len(batches) <= 13
), f"The number of batches ({len(batches)}) read does not match the expected number of batches (7 .. 13)."

print("All assertions passed. Data read successfully and validated.")
# works with polars==1.12.0
# fails with polars==1.13.0 and 1.14.0,  number of batches = 1

Big file example with 321 batches instead of 2 batches

import random
import string

import polars as pl

# Create a DataFrame with 50,000 lines
num_rows = 50000
values = range(num_rows)
strings = [
    "".join(random.choices(string.ascii_letters + string.digits, k=1000)) for _ in range(num_rows)
]
data = pl.DataFrame({"values": values, "strings": strings})

# Save the data to a single CSV file
data.write_csv("data.csv")

# Read the CSV file in batches using polars
batch_size = 40000
csv_reader = pl.read_csv_batched("data.csv", has_header=True, batch_size=batch_size)

# Collect all batches in a loop
batches = []
while True:
    batch = csv_reader.next_batches(1)
    if not batch:
        break
    batches.extend(batch)

# Concatenate all batches into a single DataFrame
all_data = pl.concat(batches)

# Assert that the data read matches the original data
assert (
    all_data.shape[0] == num_rows
), "The number of rows read does not match the expected number of rows."
assert (
    all_data.shape[1] == 2
), "The number of columns read does not match the expected number of columns."
assert all_data["values"].to_list() == list(
    values
), "The 'values' column data read does not match the expected values."
assert (
    all_data["strings"].to_list() == strings
), "The 'strings' column data read does not match the expected values."

# Assert the number of batches
expected_batches = (num_rows + batch_size - 1) // batch_size  # Calculate expected number of batches
assert (
    len(batches) == expected_batches
), f"The number of batches ({len(batches)}) read does not match the expected number of batches ({(num_rows + batch_size - 1) // batch_size })."

print("All assertions passed. Data read successfully and validated.")
# works with polars==1.12.0
# fails with polars==1.13.0 and 1.14.0,  number of batches = 321

Log output

Traceback (most recent call last):
  File "C:\Users\mysers\docs\bug_report_polars_read_csv_batches_big.py", line 49, in <module>
    len(batches) == expected_batches
AssertionError: The number of batches (321) read does not match the expected number of batches (2).

Issue description

The function read_csv_batched() allows to set the parameter batch_size.

This parameters should be respected when reading a csv file with next_batches(1) - at least more or less. In version 1.12.0, this worked quite well. Since version 1.13.0, this does not work anymore. In the example above, we get 321 instead of 2 batches for a bigger file, and 1 batch instead of 7 (or 10) batches.

Expected behavior

Use version 1.12, then the assert statements will work. This should also work after the bug fix.

Installed versions

``` --------Version info--------- Polars: 1.14.0 Index type: UInt32 Platform: Windows-10-10.0.22621-SP0 Python: 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair boto3 cloudpickle connectorx deltalake fastexcel fsspec gevent google.auth great_tables matplotlib nest_asyncio 1.6.0 numpy 2.1.3 openpyxl 3.1.5 pandas 2.2.3 pyarrow 18.0.0 pydantic pyiceberg sqlalchemy 2.0.36 torch xlsx2csv xlsxwriter 3.2.0 ```
ritchie46 commented 3 days ago

I think we should default this to None meaning let us decide and if set (try to) respect it.