pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

When setting `low_memory=True` for `scan_csv`, if the input is large enough, the output is empty #16010

Closed schmelczer closed 3 days ago

schmelczer commented 2 weeks ago

Checks

Reproducible example

import polars as pl
from random import random

pl.DataFrame({
    "id": [random() for _ in range(10000)],
}).write_csv("a.csv")

print(pl.scan_csv("a.csv", low_memory=True).collect(streaming=True))

The above prints an unexpectedly empty dataframe:

shape: (0, 1)
┌─────┐
│ id  │
│ --- │
│ f64 │
╞═════╡
└─────┘

Log output

RUN STREAMING PIPELINE
[csv -> ordered_sink]
STREAMING CHUNK SIZE: 50000 rows

Issue description

I'd like to read a large CSV (with 10 small columns) using scan_csv in streaming mode. This script is meant to run in a resource-constrained environment, so I set low_memory=True, however, this results in no rows being read. The schema is still correctly inferred but the returned DataFrame contains 0 rows. Setting low_memory to False solves the problem.

Expected behavior

I'd expect to get the same dataframe regardless low_memory is True or False.

Rerunning the above example with low_memory=False:

import polars as pl
from random import random

pl.DataFrame({
    "id": [random() for _ in range(10000)],
}).write_csv("a.csv")

print(pl.scan_csv("a.csv", low_memory=False).collect(streaming=True))

produces the result is what we'd expect:

shape: (10_000, 1)
┌──────────┐
│ id       │
│ ---      │
│ f64      │
╞══════════╡
│ 0.22836  │
│ 0.291316 │
│ 0.250347 │
│ 0.22777  │
│ 0.773068 │
│ …        │
│ 0.530433 │
│ 0.63235  │
│ 0.601679 │
│ 0.749903 │
│ 0.479096 │
└──────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.10.10 (v3.10.10:aad5f6a891, Feb 7 2023, 08:47:40) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2023.12.1 gevent: hvplot: matplotlib: 3.8.0 nest_asyncio: 1.5.7 numpy: 1.25.2 openpyxl: pandas: 2.1.0 pyarrow: 14.0.1 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: None ```
ritchie46 commented 4 days ago

@nameexhaustion I think we can remove the low-memory reader in polars-pipe. It seems to be wrong, and it is also slow as we memmove the bytes multiple times. I think the mmap reader is already low-memory as we memmap.

WDYT?

nameexhaustion commented 4 days ago

I think we can remove it for now. Later we may revisit some of the approaches the low memory reader uses depending on how we read bytes from cloud storage for async.