pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.22k stars 1.95k forks source link

Reading a CSV file when the separator parameter is set to a non-default value, it will always load the entirety of its contents into memory. #13655

Open karond-is-me opened 10 months ago

karond-is-me commented 10 months ago

Checks

Reproducible example

import polars as pl
pl.read_csv_batched("access - 副本 (2).log",batch_size=100).next_batches(1) #OK
pl.read_csv_batched("access - 副本 (2).log",separator=' ',batch_size=100).next_batches(1)  #NG
pl.scan_csv("access - 副本 (2).log").lazy().fetch(1)  #OK
pl.scan_csv("access - 副本 (2).log",separator=' ').lazy().fetch(1)  #NG

Log output

No response

Issue description

When attempting to read a large 18GB CSV file using streaming or batched reading methods, setting the separator parameter to a non-default value might lead to a memory explosion, despite this phenomenon not being reflected in the Windows process manager. Additionally, I speculate that bug #9266 may be related to this issue. 图片

Expected behavior

To avoid loading the entire content into memory, you can utilize streaming or batched reading methods instead.

Installed versions

Polars: 0.20.3 Index type: UInt32 Platform: Windows-10-10.0.18363-SP0 Python: 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: fsspec: 2023.4.0 gevent: hvplot: 0.8.4 matplotlib: 3.7.2 numpy: 1.24.3 openpyxl: 3.0.10 pandas: 2.0.3 pyarrow: 11.0.0 pydantic: 1.10.8 pyiceberg: pyxlsb: sqlalchemy: 1.4.39 xlsx2csv: xlsxwriter:
itamarst commented 9 months ago

I tried this on Linux, with /usr/bin/time -v python script.py to measure max resident memory, with version on main from Jan 17 2024. Was unable to see any memory usage difference between runs with and without separator, albeit with a different file than the one reporter used.

karond-is-me commented 9 months ago

After updating Polars to version 0.20.5, I noticed no discernible changes on my Windows computer. 图片