Open PhilippJunk opened 7 months ago
As the error states, we don't support the full query streaming yet, so sink_csv
results in that error.
This is expected behavior. In the future we might resolve the collect() -> write
ourselves, but this is expected behavior.
We are working on supporting more of our queries streaming, it is an ongoing process.
Thanks for clarifying. Is it also expected behavior that the third examples runs without an error, but silently omits anything from the join?
It would be great if an error could be raised when trying to sink a query that's not fully supported instead of generating an incorrect result.
Unsurprisingly this affects sink_parquet too.
EDIT: This might be different as there's not even a join here, just a concat. Should this be its own issue? Or is it the same underlying problem?
Also ran across this problem, and worked up a minimal example. The streaming engine is probably the biggest draw of polars for me, so I'd really love to see this fixed.
Minimal example
df1 = pl.LazyFrame({"Name": ["A"], "X": [1]})
df2 = pl.LazyFrame({"Name": ["B"], "X": [2]})
merged_df = pl.concat([df1, df2], how="align")
merged_df.sink_csv("sunk.csv")
merged_df.collect().write_csv("written.csv")
Contents of sunk.csv:
Name,X
B,2
Contents of written.csv:
Name,X
A,1
B,2
Interestingly this bug goes away if you omit the how="align"
from the join. In that case, sunk.csv is identical to written.csv.
EDIT: This example doesn't take the same logical path—it's actually equivalent (I think) to my above example without how="align"
, so no surprise it works.
One more detail -- the Rust interface doesn't replicate the issue with the same logical setup:
use polars::prelude::*;
fn main() {
let sink_df = make_concat_lazy_example();
sink_df.sink_csv("out/sunk.csv", CsvWriterOptions::default());
let write_df = make_concat_lazy_example();
let mut write_file = std::fs::File::create("out/written.csv").unwrap();
CsvWriter::new(&mut write_file).finish(&mut write_df.collect().unwrap());
}
fn make_concat_lazy_example() -> LazyFrame {
let df1 = df![
"Name" => ["A"],
"X" => [1]
].unwrap().lazy();
let df2 = df![
"Name" => ["B"],
"X" => [2]
].unwrap().lazy();
let merged_df = concat(
[df1, df2],
UnionArgs::default()
).unwrap();
merged_df
}
This produces identical CSVs containing both rows.
@sclamons align
is a wrapper around full joins:
It looks like coalesce()
is causing the issue on the streaming engine, without it - both rows are present.
(df1.join(df2, how="full", on=["Name", "X"], suffix="_PL_CONCAT_RIGHT")
.with_columns(
pl.coalesce([name, f"{name}_PL_CONCAT_RIGHT"])
for name in ["Name", "X"]
)
.collect(streaming=True)
)
# shape: (1, 4)
# ┌──────┬─────┬──────────────────────┬───────────────────┐
# │ Name ┆ X ┆ Name_PL_CONCAT_RIGHT ┆ X_PL_CONCAT_RIGHT │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str ┆ i64 │
# ╞══════╪═════╪══════════════════════╪═══════════════════╡
# │ B ┆ 2 ┆ B ┆ 2 │
# └──────┴─────┴──────────────────────┴───────────────────┘
I think your Rust example is just doing a default vertical concat, so it's not equivalent to the Python repro?
@cmdlineluser Yes, you're right—the Rust example isn't taking the how="align"
path, so no surprise it works.
Checks
Reproducible example
The same problem exists if the overlap is not empty:
I originally noticed this after an additional concat operation, which does not error, but silently omits some of the data:
Log output
Issue description
sink_csv
does not behave as expected after join operations on LazyFrames. In some cases it errors. In other cases, it silently produces different results compared tocollect().write_csv()
Expected behavior
df.sink_csv(file)
anddf.collect().write_csv(file)
should lead to the identical output.Installed versions