Open laserson opened 5 days ago
And same thing if I first execute
pl.Config.set_streaming_chunk_size(10)
And finally, maybe a related issue, is that a call to pl.read_csv_batched
also uses an enormous amount of memory before I even request any batches of data.
Does the issue occur if you uncompress the file prior to scan_csv()
? I wonder if it's trying to uncompress the entire file into memory before applying the rest of the queryplan.
Yes, I can confirm the same behavior if I first uncompress the file.
We don't support streaming decompression yet.
Yes, I can confirm the same behavior if I first uncompress the file.
I cannot reproduce this. Slice push-down works on non-streaming files and runs in a fraction of the time in my local run.
We don't support streaming decompression yet.
Is there a workaround for this? Perhaps by giving scan_csv
a file-like object that handles the decompression?
I cannot reproduce this. Slice push-down works on non-streaming files and runs in a fraction of the time in my local run.
Could you share the command that you are running? For me, the query does in fact run on the decompressed data, but it still uses a surprisingly large amount of memory.
As an alternative, could you recommend a separate way to convert the large gzipped csv file into Parquet without needing a huge amount of memory?
Checks
Reproducible example
Download this large CSV file from the OAS database:
https://opig.stats.ox.ac.uk/webapps/ngsdb/unpaired/Soto_2019/csv/SRR8365432_1_Heavy_IGHA.csv.gz
The gzipped file is about 2.7 GB.
Then run the following.
My Python process starts using up all my RAM and then swapping. Sometimes the process gets killed by the OS. Same thing if I use
low_memory=True
inscan_csv
or if I usestreaming=True
incollect
.This is very surprising to me that such a query would require all of my memory. Is this because even in streaming mode, Polars will materialize a big batch of rows in memory no matter what? Is it possible to control that? My goal would be to do some light casting of columns followed by
sink_parquet
with very little memory usage.Log output
No response
Issue description
See above.
Expected behavior
Low-memory usage for streaming a simple select => head or select => sink_parquet.
Installed versions