Open men1n2 opened 5 months ago
I have encountered similar issue here, with additional info to share.
When processing one 100GB csv file, the sink_csv
method will OOM. The return type is DataFrame and i suspect this is the issue. One need not to have the data in memory once calling sink_csv
, otherwise it will OOM for sure. I would like to find some hidden Rust Polars APIs for this task, if possible.
Another strange behavior is that by setting certain parameters and mostly environment variables the code can run but however the output file is significantly smaller than the input file. If one enlarges these numbers the output size will increase and eventually leading to OOM. The relationship is exponential.
Additionally the way it handles UTF8 decoding issue is unclear. Did it skip the error line?
My code is shown below:
import os
os.environ["POLARS_MAX_THREADS"]="4"
os.environ["POLARS_STREAMING_CHUNK_SIZE"]="4"
import polars
input_csv = ...
output_csv = ...
df=polars.scan_csv(input_csv, ignore_errors, truncated_ragged_lines=True, low_memory=True)
df.unique(subset="company_id").sink_csv(output_csv, batch_size=10, include_header=True)
Any updates on this issue ?
Checks
Reproducible example
Log output
Issue description
I am using Polars with Rust to process a large dataset (larger than available memory) inside a Docker container. The Docker container's memory is intentionally limited to 20GB using
--memory=20gb
and--shm-size=20gb
. I am encountering an out of memory error while performing calculations on the dataset.Here's an overview of my workflow:
Despite using LazyFrame and enabling low_memory mode in ScanArgsParquet, I still encounter an out of memory error during the execution of the code.
I have tried the following:
low_memory: true
in the scan_parquet function.The printed plan indicates that every operation should be run in the streaming engine:
However, I am still running into memory issues when processing the large dataset (Parquet file size = 20GB).
Expected behavior
Polars being able to unnest the fields and write the output to the parquet, even though the input file content is larger than memory
Installed versions