Open etiennebacher opened 3 hours ago
I think the difference can also be seen without sinking.
df1 = pl.LazyFrame({"a": ["foo"], "b": [1]})
df2 = pl.LazyFrame({"a": ["bar"], "b": [2]})
q = df1.join(df2, how="full", on=["a", "b"]).with_columns(c = 1)
q.collect()
# shape: (2, 5)
# ┌──────┬──────┬─────────┬─────────┬─────┐
# │ a ┆ b ┆ a_right ┆ b_right ┆ c │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str ┆ i64 ┆ i32 │
# ╞══════╪══════╪═════════╪═════════╪═════╡
# │ null ┆ null ┆ bar ┆ 2 ┆ 1 │
# │ foo ┆ 1 ┆ null ┆ null ┆ 1 │
# └──────┴──────┴─────────┴─────────┴─────┘
q.collect(streaming=True)
# shape: (1, 5)
# ┌──────┬──────┬─────────┬─────────┬─────┐
# │ a ┆ b ┆ a_right ┆ b_right ┆ c │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str ┆ i64 ┆ i32 │
# ╞══════╪══════╪═════════╪═════════╪═════╡
# │ null ┆ null ┆ bar ┆ 2 ┆ 1 │
# └──────┴──────┴─────────┴─────────┴─────┘
Checks
Reproducible example
Log output
Issue description
In the example above, three LazyFrames are joined. If the resulting LazyFrame is written to a file via
sink_csv()
(orsink_parquet()
) the result is different than first callingcollect()
and thenwrite_csv()
. Namely, when usingsink_csv()
, some rows are missing in the output.If the last
with_columns()
call is removed then usingsink_csv()
orcollect()
+write_csv()
give the same output.Originally reported in https://github.com/pola-rs/r-polars/issues/1246 (cc @Columbus240)
Expected behavior
Number of rows should be equivalent using
sink_csv()
orcollect()
+write_csv()
.Installed versions