pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.86k stars 1.71k forks source link

Add symmetric_difference #6947

Open mcrumiller opened 1 year ago

mcrumiller commented 1 year ago

Problem description

It's currently clunky to find the difference or symmetric difference between two dataframes. The one-way (asymmetric) difference is pretty easy using an anti-join:

import polars as pl

df1 = pl.DataFrame({
    'a': [1, 2, 3, 4],
    'b': [1, 2, 3, 4]
})
df2 = pl.DataFrame({
    'a': [3, 4, 6, 7],
    'b': [3, 4, 6, 7]
})

# returns items in A that aren't in B
df1.join(df2, on=['a', 'b'], how="anti")
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 2   │
└─────┴─────┘

# returns items in B that aren't in A
df2.join(df1, on=['a', 'b'], how="anti")
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 6   ┆ 6   │
│ 7   ┆ 7   │
└─────┴─────┘

so to the get the symmetric difference, we can combine these two, but it's not too pretty, and with a lot of duplicates I imagine it's not as performant as it could be:

# use dual anti-joins
df1.join(df2, on=['a', 'b'], how="anti").vstack(df2.join(df1, on=['a', 'b'], how="anti")).unique()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 2   │
│ 6   ┆ 6   │
│ 7   ┆ 7   │
└─────┴─────┘

A simple df1.symmetric_difference(df2) would be nice. I also think a simple df1.difference(df2) as a shorthand for an anti-join would be nice as well.

But regardless, any suggestions for a better solution would be appreciated.

mcrumiller commented 1 year ago

I note another solution from this stackoverflow question using my above dataframes, modified a bit because that answer would fail if there are any duplicates in either frame, and the row count is unnecessary:

pl.concat((df1.unique(), df2.unique())).filter(pl.count().over(['a', 'b']) == 1)
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 2   │
│ 6   ┆ 6   │
│ 7   ┆ 7   │
└─────┴─────┘

which may be more performant.

ghuls commented 1 year ago

Do an outer join followed by a filter.

df1.with_columns(
    pl.lit(True).alias("df1")
)
.join(
    df2.with_columns(
        pl.lit(True).alias("df2")
    ),
    on=['a', 'b'], 
    how="outer"
).filter(
    pl.col("df1") != pl.col("df2")
)

shape: (4, 4)
┌─────┬─────┬──────┬──────┐
│ a   ┆ b   ┆ df1  ┆ df2  │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ bool ┆ bool │
╞═════╪═════╪══════╪══════╡
│ 6   ┆ 6   ┆ null ┆ true │
│ 7   ┆ 7   ┆ null ┆ true │
│ 1   ┆ 1   ┆ true ┆ null │
│ 2   ┆ 2   ┆ true ┆ null │
└─────┴─────┴──────┴──────┘
mcrumiller commented 1 year ago

@ghuls that sounds like a memory disaster waiting to happen.

ghuls commented 1 year ago

@ghuls that sounds like a memory disaster waiting to happen.

If you do it in lazy mode with streaming, it shouldn't be a problem:

https://github.com/pola-rs/polars/pull/5339