Open matterhorn103 opened 3 months ago
It's sort of like .update()
except it only gives you the intersection of columns, not the union.
>>> df.update(df2, on="i", how="full")
shape: (6, 3)
┌─────┬──────┬──────┐
│ i ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪══════╪══════╡
│ 2 ┆ 5.3 ┆ cd │
│ 3 ┆ 3.5 ┆ ef │
│ 5 ┆ 2.5 ┆ null │
│ 6 ┆ 0.9 ┆ null │
│ 4 ┆ null ┆ gh │
│ 1 ┆ 2.6 ┆ ab │
└─────┴──────┴──────┘
Perhaps adding the functionality to update
would make sense?
Absolutely, it would be nice to allow update()
to add columns too.
However, update()
will also replace values from the original DataFrame with ones from the new one though, if they are not null
. And I think adding an option to update()
to instead prefer the original values would go against the intuitive notion of "updating" the values.
If I append the example with a 7th identifier to show the desired behaviour:
import polars as pl
df = pl.DataFrame(
{
"i": [1, 2, 3, 4, 7],
"a": [2.6, 5.3, None, None, 7.0],
"b": ["ab", "cd", "ef", "gh", "ij"]
}
)
df2 = pl.DataFrame(
{
"i": [2, 3, 5, 6, 7],
"a": [None, 3.5, 2.5, 0.9, 0.0],
"c": [True, False, False, True, None]
}
)
then the desired result with df.supplement(df2, on="i", how="full")
would be:
┌─────┬──────┬──────┬───────┐
│ i ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ bool │
╞═════╪══════╪══════╪═══════╡
│ 2 ┆ 5.3 ┆ cd ┆ true │
│ 3 ┆ 3.5 ┆ ef ┆ false │
│ 5 ┆ 2.5 ┆ null ┆ false │
│ 6 ┆ 0.9 ┆ null ┆ true │
│ 1 ┆ 2.6 ┆ ab ┆ null │
│ 4 ┆ null ┆ gh ┆ null │
| 7 ┆ 7.0 ┆ ij ┆ null |
└─────┴──────┴──────┴───────┘
but with update()
we get:
┌─────┬──────┬──────┐
│ i ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪══════╪══════╡
│ … ┆ … ┆ … │
│ 7 ┆ 0.0 ┆ ij │
│ … ┆ … ┆ … │
└─────┴──────┴──────┘
So having a supplement()
method would be a nice complement to update()
. What I like about update()
is that it is obvious what it does, and I think supplement()
would be similarly clear.
I guess if the option to take the union of columns was added to update()
then they would mirror each other (or whatever the technical term is) i.e. df.supplement(df2) == df2.update(df)
(for full joins, ignoring the order of rows and columns). Indeed, if they were defined in that way, it would make understanding and using them trivial for users as the available options could match etc.
But the two operations are conceptually distinct and I think the different intentions would make it nice and intuitive to have both. So I think the existence of the update()
method if anything strengthens the case for the suggestion – since we have one as syntactic sugar, why not both? :grin:
Edit: This would be similar to
df.update()
(see comments 2 and 3), butdf.supplement(df2)
would a) prioritize the non-null results indf
and only add the values fromdf2
that don't have corresponding values indf
already (i.e. the opposite ofupdate()
) and b) ifdf2
has columns that aren't indf
, those columns and values will be added todf
as well.Description
Hi,
The issue I was originally facing and two solutions to it can be found on StackExchange here.
In short, I wanted to know how to combine two separate DataFrames containing different data on an overlapping set of identifiers like:
into
It seems like the
DataFrame.group_by().agg()
approach suggested at the linked question is the most idiomatic solution.However, I originally assumed there was already some method I didn't know, because I don't think that my situation is necessarily all that uncommon. Would you consider making it possible to quickly and easily combine two DataFrames that have overlapping rows and columns?
My desire to do such a thing seemed to surprise some people on Discord but I think it's actually a fairly natural operation. I reckon that if most people were to be told to manually combine two sets of data like in my example, and were instructed which dataset should take priority, they would find it obvious what the result should be, and it would be nice to have an easy way to do this in polars.
Since it is essentially adding cells from the non-prioritized DataFrame to the prioritized DataFrame, it could be called
DataFrame.supplement()
, where whatdf.supplement(df2, on="i")
would do is essentially add new empty rows todf
for any values of"i"
that are indf2
but not indf
, likewise add any necessary columns, and then populate all cells that containnull
with the corresponding values fromdf2
.I like the idea of a
df.supplement()
method, but if you would prefer a function perhaps it could be one calledpl.combine()
orpl.merge()
orpl.fuse()
or something. The operation wouldn't be restricted to two then, though there'd have to be some way to specify the priority. I guess via the order of the arguments, like forpl.concat()
.Additionally – would it be possible to make the
coalesce
parameter ofDataFrame.join()
to be not a bool but a list of columns and all those specified get coalesced during the join, not just the column being joined on? With perhaps a wildcard"*"
being allowed by analogy topl.col("*")
? This would save having to do a more verbosecoalesce()
operation on the suffixed columns after the join.P.S. I could have sworn, to be honest, that either polars or pandas had a method that took a DataFrame post-join and automatically coalesced those columns that differed only by a
_left
and_right
suffix, but I can't find anything of the sort.