pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

Add `df.supplement()` to add data from another dataframe that isn't already present #17509

Open matterhorn103 opened 3 months ago

matterhorn103 commented 3 months ago

Edit: This would be similar to df.update() (see comments 2 and 3), but df.supplement(df2) would a) prioritize the non-null results in df and only add the values from df2 that don't have corresponding values in df already (i.e. the opposite of update()) and b) if df2 has columns that aren't in df, those columns and values will be added to df as well.

Description

Hi,

The issue I was originally facing and two solutions to it can be found on StackExchange here.

In short, I wanted to know how to combine two separate DataFrames containing different data on an overlapping set of identifiers like:

import polars as pl

df = pl.DataFrame(
    {
        "i": [1, 2, 3, 4],
        "a": [2.6, 5.3, None, None],
        "b": ["ab", "cd", "ef", "gh"]
    }
)

df2 = pl.DataFrame(
    {
        "i": [2, 3, 5, 6],
        "a": [None, 3.5, 2.5, 0.9],
        "c": [True, False, False, True]
    }
)

into

┌─────┬──────┬──────┬───────┐
│ i   ┆ a    ┆ b    ┆ c     │
│ --- ┆ ---  ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ str  ┆ bool  │
╞═════╪══════╪══════╪═══════╡
│ 2   ┆ 5.3  ┆ cd   ┆ true  │
│ 3   ┆ 3.5  ┆ ef   ┆ false │
│ 5   ┆ 2.5  ┆ null ┆ false │
│ 6   ┆ 0.9  ┆ null ┆ true  │
│ 1   ┆ 2.6  ┆ ab   ┆ null  │
│ 4   ┆ null ┆ gh   ┆ null  │
└─────┴──────┴──────┴───────┘

It seems like the DataFrame.group_by().agg() approach suggested at the linked question is the most idiomatic solution.

However, I originally assumed there was already some method I didn't know, because I don't think that my situation is necessarily all that uncommon. Would you consider making it possible to quickly and easily combine two DataFrames that have overlapping rows and columns?

My desire to do such a thing seemed to surprise some people on Discord but I think it's actually a fairly natural operation. I reckon that if most people were to be told to manually combine two sets of data like in my example, and were instructed which dataset should take priority, they would find it obvious what the result should be, and it would be nice to have an easy way to do this in polars.

Since it is essentially adding cells from the non-prioritized DataFrame to the prioritized DataFrame, it could be called DataFrame.supplement(), where what df.supplement(df2, on="i") would do is essentially add new empty rows to df for any values of "i" that are in df2 but not in df, likewise add any necessary columns, and then populate all cells that contain null with the corresponding values from df2.

I like the idea of a df.supplement() method, but if you would prefer a function perhaps it could be one called pl.combine() or pl.merge() or pl.fuse() or something. The operation wouldn't be restricted to two then, though there'd have to be some way to specify the priority. I guess via the order of the arguments, like for pl.concat().


Additionally – would it be possible to make the coalesce parameter of DataFrame.join() to be not a bool but a list of columns and all those specified get coalesced during the join, not just the column being joined on? With perhaps a wildcard "*" being allowed by analogy to pl.col("*")? This would save having to do a more verbose coalesce() operation on the suffixed columns after the join.


P.S. I could have sworn, to be honest, that either polars or pandas had a method that took a DataFrame post-join and automatically coalesced those columns that differed only by a _left and _right suffix, but I can't find anything of the sort.

cmdlineluser commented 3 months ago

It's sort of like .update() except it only gives you the intersection of columns, not the union.

>>> df.update(df2, on="i", how="full")
shape: (6, 3)
┌─────┬──────┬──────┐
│ i   ┆ a    ┆ b    │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ f64  ┆ str  │
╞═════╪══════╪══════╡
│ 2   ┆ 5.3  ┆ cd   │
│ 3   ┆ 3.5  ┆ ef   │
│ 5   ┆ 2.5  ┆ null │
│ 6   ┆ 0.9  ┆ null │
│ 4   ┆ null ┆ gh   │
│ 1   ┆ 2.6  ┆ ab   │
└─────┴──────┴──────┘

Perhaps adding the functionality to update would make sense?

matterhorn103 commented 3 months ago

Absolutely, it would be nice to allow update() to add columns too.

However, update() will also replace values from the original DataFrame with ones from the new one though, if they are not null. And I think adding an option to update() to instead prefer the original values would go against the intuitive notion of "updating" the values.

If I append the example with a 7th identifier to show the desired behaviour:

import polars as pl

df = pl.DataFrame(
    {
        "i": [1, 2, 3, 4, 7],
        "a": [2.6, 5.3, None, None, 7.0],
        "b": ["ab", "cd", "ef", "gh", "ij"]
    }
)

df2 = pl.DataFrame(
    {
        "i": [2, 3, 5, 6, 7],
        "a": [None, 3.5, 2.5, 0.9, 0.0],
        "c": [True, False, False, True, None]
    }
)

then the desired result with df.supplement(df2, on="i", how="full") would be:

┌─────┬──────┬──────┬───────┐
│ i   ┆ a    ┆ b    ┆ c     │
│ --- ┆ ---  ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ str  ┆ bool  │
╞═════╪══════╪══════╪═══════╡
│ 2   ┆ 5.3  ┆ cd   ┆ true  │
│ 3   ┆ 3.5  ┆ ef   ┆ false │
│ 5   ┆ 2.5  ┆ null ┆ false │
│ 6   ┆ 0.9  ┆ null ┆ true  │
│ 1   ┆ 2.6  ┆ ab   ┆ null  │
│ 4   ┆ null ┆ gh   ┆ null  │
| 7   ┆ 7.0  ┆ ij   ┆ null  |
└─────┴──────┴──────┴───────┘

but with update() we get:

┌─────┬──────┬──────┐
│ i   ┆ a    ┆ b    │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ f64  ┆ str  │
╞═════╪══════╪══════╡
│ …   ┆ …    ┆ …    │
│ 7   ┆ 0.0  ┆ ij   │
│ …   ┆ …    ┆ …    │
└─────┴──────┴──────┘

So having a supplement() method would be a nice complement to update(). What I like about update() is that it is obvious what it does, and I think supplement() would be similarly clear.

I guess if the option to take the union of columns was added to update() then they would mirror each other (or whatever the technical term is) i.e. df.supplement(df2) == df2.update(df) (for full joins, ignoring the order of rows and columns). Indeed, if they were defined in that way, it would make understanding and using them trivial for users as the available options could match etc.

But the two operations are conceptually distinct and I think the different intentions would make it nice and intuitive to have both. So I think the existence of the update() method if anything strengthens the case for the suggestion – since we have one as syntactic sugar, why not both? :grin: