Open EpicUsaMan opened 1 month ago
Without a reproducable example this isn't something we can fix sadly. :/
Though we are making a whole new engine and removing the old streaming one, so it might get fixed implicitly.
Without a reproducable example this isn't something we can fix sadly. :/
Though we are making a whole new engine and removing the old streaming one, so it might get fixed implicitly.
It's actually just somehow in streaming we are losing error, which I described here, but still can't prepare minimal example It produce a bug only on very large pipelines during streaming Removing .over() by columns with only 1 value fixes the problem
So, I made another ticket: https://github.com/pola-rs/polars/issues/18600
I encountered the same error while trying to process a Parquet file with LazyFrame (not sure, if it's related to OP's issue). Happens with and without streaming.
To download test file to the current directory:
aws s3 cp s3://overturemaps-us-west-2/release/2024-07-22.0/theme=addresses/type=address/part-00000-a1dedcdb-edf7-42c4-aea4-87ddc4d97b65-c000.zstd.parquet ./ --no-sign-request
(I've trimmed full paths to just filenames)
let in_file_path = "part-00000-a1dedcdb-edf7-42c4-aea4-87ddc4d97b65-c000.zstd.parquet";
let lf = LazyFrame::scan_parquet(in_file_path, ScanArgsParquet::default()).unwrap();
let lf = lf.sort(["country", "postcode", "street", "number", "unit"],
SortMultipleOptions::default());
let out_file_path = "sorted.parquet";
let mut write_options = ParquetWriteOptions::default();
write_options.compression = ParquetCompression::Lz4Raw;
lf.with_streaming(true).sink_parquet(out_file_path, write_options).unwrap();
panicked at ...\polars-arrow-0.43.1\src\record_batch.rs:22:31:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("RecordBatch requires all its arrays to have an equal number of rows"))
OS: Windows
Checks
Reproducible example
Can't guarantee any reproducible example, because it happens only on really large frames (close to memory limit of 256 GB)
Log output
Issue description
This error happen ONLY when running query with streaming=True and starts to appear after upgrade to polars 1.6.0
my flow is: scan_parquet -> filter -> sort -> cum_sum / ewm_mean -> group_by
(Happens almost immediately after running pipeline, so looks like falling during scan/sort)
Expected behavior
Streaming to work exactly the same as regular pipeline
Installed versions