pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.28k stars 1.85k forks source link

High memory usage for `scan_csv()->head()` on compressed CSV file #18724

Open laserson opened 5 days ago

laserson commented 5 days ago

Checks

Reproducible example

Download this large CSV file from the OAS database:

https://opig.stats.ox.ac.uk/webapps/ngsdb/unpaired/Soto_2019/csv/SRR8365432_1_Heavy_IGHA.csv.gz

The gzipped file is about 2.7 GB.

Then run the following.

import polars as pl
df = pl.scan_csv(
    "SRR8365432_1_Heavy_IGHA.csv.gz",
    has_header=True,
    skip_rows=1,
).head().collect()

My Python process starts using up all my RAM and then swapping. Sometimes the process gets killed by the OS. Same thing if I use low_memory=True in scan_csv or if I use streaming=True in collect.

This is very surprising to me that such a query would require all of my memory. Is this because even in streaming mode, Polars will materialize a big batch of rows in memory no matter what? Is it possible to control that? My goal would be to do some light casting of columns followed by sink_parquet with very little memory usage.

Log output

No response

Issue description

See above.

Expected behavior

Low-memory usage for streaming a simple select => head or select => sink_parquet.

Installed versions

``` --------Version info--------- Polars: 1.7.0 Index type: UInt32 Platform: macOS-14.3.1-arm64-arm-64bit Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec 2024.6.1 gevent great_tables matplotlib 3.8.3 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl 3.1.5 pandas 2.2.1 pyarrow 17.0.0 pydantic 1.10.14 pyiceberg sqlalchemy 2.0.28 torch 2.1.2.post3 xlsx2csv xlsxwriter ```
laserson commented 5 days ago

And same thing if I first execute

pl.Config.set_streaming_chunk_size(10)
laserson commented 5 days ago

And finally, maybe a related issue, is that a call to pl.read_csv_batched also uses an enormous amount of memory before I even request any batches of data.

aut0clave commented 5 days ago

Does the issue occur if you uncompress the file prior to scan_csv()? I wonder if it's trying to uncompress the entire file into memory before applying the rest of the queryplan.

laserson commented 5 days ago

Yes, I can confirm the same behavior if I first uncompress the file.

ritchie46 commented 4 days ago

We don't support streaming decompression yet.

Yes, I can confirm the same behavior if I first uncompress the file.

I cannot reproduce this. Slice push-down works on non-streaming files and runs in a fraction of the time in my local run.

laserson commented 4 days ago

We don't support streaming decompression yet.

Is there a workaround for this? Perhaps by giving scan_csv a file-like object that handles the decompression?

I cannot reproduce this. Slice push-down works on non-streaming files and runs in a fraction of the time in my local run.

Could you share the command that you are running? For me, the query does in fact run on the decompressed data, but it still uses a surprisingly large amount of memory.

laserson commented 4 days ago

As an alternative, could you recommend a separate way to convert the large gzipped csv file into Parquet without needing a huge amount of memory?