Open mcrumiller opened 1 year ago
I note another solution from this stackoverflow question using my above dataframes, modified a bit because that answer would fail if there are any duplicates in either frame, and the row count is unnecessary:
pl.concat((df1.unique(), df2.unique())).filter(pl.count().over(['a', 'b']) == 1)
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 6 ┆ 6 │
│ 7 ┆ 7 │
└─────┴─────┘
which may be more performant.
Do an outer join followed by a filter.
df1.with_columns(
pl.lit(True).alias("df1")
)
.join(
df2.with_columns(
pl.lit(True).alias("df2")
),
on=['a', 'b'],
how="outer"
).filter(
pl.col("df1") != pl.col("df2")
)
shape: (4, 4)
┌─────┬─────┬──────┬──────┐
│ a ┆ b ┆ df1 ┆ df2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ bool ┆ bool │
╞═════╪═════╪══════╪══════╡
│ 6 ┆ 6 ┆ null ┆ true │
│ 7 ┆ 7 ┆ null ┆ true │
│ 1 ┆ 1 ┆ true ┆ null │
│ 2 ┆ 2 ┆ true ┆ null │
└─────┴─────┴──────┴──────┘
@ghuls that sounds like a memory disaster waiting to happen.
@ghuls that sounds like a memory disaster waiting to happen.
If you do it in lazy mode with streaming, it shouldn't be a problem:
Problem description
It's currently clunky to find the difference or symmetric difference between two dataframes. The one-way (asymmetric) difference is pretty easy using an anti-join:
so to the get the symmetric difference, we can combine these two, but it's not too pretty, and with a lot of duplicates I imagine it's not as performant as it could be:
A simple
df1.symmetric_difference(df2)
would be nice. I also think a simpledf1.difference(df2)
as a shorthand for an anti-join would be nice as well.But regardless, any suggestions for a better solution would be appreciated.