When setting `low_memory=True` for `scan_csv`, if the input is large enough, the output is empty

schmelczer commented 2 weeks ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from random import random

pl.DataFrame({
    "id": [random() for _ in range(10000)],
}).write_csv("a.csv")

print(pl.scan_csv("a.csv", low_memory=True).collect(streaming=True))

The above prints an unexpectedly empty dataframe:

shape: (0, 1)
┌─────┐
│ id  │
│ --- │
│ f64 │
╞═════╡
└─────┘

Log output

RUN STREAMING PIPELINE
[csv -> ordered_sink]
STREAMING CHUNK SIZE: 50000 rows

Issue description

I'd like to read a large CSV (with 10 small columns) using scan_csv in streaming mode. This script is meant to run in a resource-constrained environment, so I set low_memory=True, however, this results in no rows being read. The schema is still correctly inferred but the returned DataFrame contains 0 rows. Setting low_memory to False solves the problem.

Expected behavior

I'd expect to get the same dataframe regardless low_memory is True or False.

Rerunning the above example with low_memory=False:

import polars as pl
from random import random

pl.DataFrame({
    "id": [random() for _ in range(10000)],
}).write_csv("a.csv")

print(pl.scan_csv("a.csv", low_memory=False).collect(streaming=True))

produces the result is what we'd expect:

shape: (10_000, 1)
┌──────────┐
│ id       │
│ ---      │
│ f64      │
╞══════════╡
│ 0.22836  │
│ 0.291316 │
│ 0.250347 │
│ 0.22777  │
│ 0.773068 │
│ …        │
│ 0.530433 │
│ 0.63235  │
│ 0.601679 │
│ 0.749903 │
│ 0.479096 │
└──────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.10.10 (v3.10.10:aad5f6a891, Feb 7 2023, 08:47:40) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2023.12.1 gevent: hvplot: matplotlib: 3.8.0 nest_asyncio: 1.5.7 numpy: 1.25.2 openpyxl: pandas: 2.1.0 pyarrow: 14.0.1 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: None ```

ritchie46 commented 4 days ago

@nameexhaustion I think we can remove the low-memory reader in polars-pipe. It seems to be wrong, and it is also slow as we memmove the bytes multiple times. I think the mmap reader is already low-memory as we memmap.

WDYT?

nameexhaustion commented 4 days ago

I think we can remove it for now. Later we may revisit some of the approaches the low memory reader uses depending on how we read bytes from cloud storage for async.

pola-rs / polars