ENH: coalesce-method (upgrade for update/combine_first)

h-vetinari commented 6 years ago

The state of update/combine_first in v0.23:

.update signature does not match between DataFrame/Series (#22358)
df.update has a join-kwarg that only supports left, although the source code itself notes: # TODO: Support other joins (#21855)
.update is one of the (very) few pandas-methods that's inplace by default, but does not have an inplace-kwarg (#22286)
.combine_first is effectively (the not-yet-implemented) .update(join='outer'), has an awkward, non-standard name, and much fewer capabilities than .update. (#21859)

I tried to make some steps towards #21855 and #21859 by adding an inplace-kwarg to df.update in #22286, which has been stalled in discussion whether update should ever be inplace at all, resp. how to move away from inplacing generally.

Today, some headway was made with the comment by @jreback:

So we have .update (in-place defaults) and .combine_first which is not very standard terminology. In an ideal world I think adding .coalesce is probably the right thing to do (does R use this term?). which is basically a rename of .combine_first, and deprecate .update.

which I'm strongly in favour of (with the caveat that it should use the capabilities of update; I suggested something similar in #21855; would also solve most of the discussion there). And yes, dplyr uses "coalesce", which itself is inspired by SQL: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf#page.15

This discussion is opened on the advice of @jreback, who would like to involve:

[...] to get some more commentary on this, esp from @jorisvandenbossche and @TomAugspurger (and some off-line discussions that I had with @cpcloud )

Also tagging the other participants of #21855: @gfyoung @toobaz

Summing up this proposal:

Add .coalesce to generic.py, à la: def coalesce(self, other, join='left', overwrite=True, filter_func=None, raise_conflict=False): which is not inplace and inherited by DataFrame/Series
support different joins, at least: join='left'|'outer'|'inner'|'right' (most of the discussion in #21855 is about potentially allowing different joins for different axes, and which keywords to use for that).
(potentially; not essential to the proposal) slowly deprecate .update and .combine_first

jorisvandenbossche commented 6 years ago

Just about the name:

One disadvantage of this name: not being native speaking, and not coming from a database background, I would not even have the slightest idea what even the word itself would mean, let alone what it would do.

I know it is a term used in SQL (and which is a big plus for using such a name), and in the meantime I know it from there, but IMO it is a rather newcomer unfriendly name. In that sense, I find 'combine' or 'update' more natural terminology.

Then about the functionality: I think a coalesce method would need to be rather like a combine_first than as a update, in terms of default overwrite behaviour (which might not be what you want?), since the SQL coalesce is about the first non-missing value.

(sidenote in general: sorry for the slow progress at the moment in those discussions, it seems there are currently not many other core devs that have time to participate, and such discussions take time / need input of other people. That can certainly be annoying, but is part of the open source process ..)

h-vetinari commented 6 years ago

@jorisvandenbossche

Thanks for the response! I understand that dev time is a very limited resource... ;-)

I don't feel strongly about the name. I'd be fine with having it under update (which some people strongly oppose to being not inplace; hence this issue), and I'd be fine with having it as coalesce (which has the SQL background; though you're right that the default for overwrite should then be False).

I'd also be fine with naming it something shorter (and less specific than update/coalesce/combine_first), like fuse, which - in terms of language difficulty - should be IMO on par with melt/pivot/merge, and would avoid clashing with people's pre-conceived notions about what capabilities the name implies (inplace/overwrite/join-style etc.). It's also nice and short to type.

https://www.dictionary.com/browse/fuse: [...] verb (used with object), fused, fus·ing.

to combine or blend by melting together; melt.
to unite or blend into a whole, as if by melting together:

The author skillfully fuses these ~~fragments~~ DataFrames into a cohesive whole.

verb (used without object), fused, fus·ing. [...]

to become united or blended:

The two ~~groups~~ DataFrames fused to create one strong union.

In any case, my main point is about having the desired capabilities, not about a specific name.

h-vetinari commented 5 years ago

Re-pinging @jreback @jorisvandenbossche @TomAugspurger @cpcloud @gfyoung @toobaz

gfyoung commented 5 years ago

@h-vetinari : Yikes! Sorry that we all went dark on this...

Given that you already had some backing already on this proposal from @jreback (and in some way from @jorisvandenbossche ), I would suggest that you try implementing this and open a PR. In terms of implementation, I'm inclined to agree that naming / behavior should be consistent with SQL.

If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be.

gfyoung commented 5 years ago

@pandas-dev/pandas-core @h-vetinari : As a side note, I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile...

Perhaps we / you might want to clean those up first for merging before opening yet another PR? 😉

h-vetinari commented 5 years ago

@gfyoung

If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be.

My preference would be .fuse, as outlined in https://github.com/pandas-dev/pandas/issues/22812#issuecomment-425079186

I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile... Perhaps we / you might want to clean those up first for merging before opening yet another PR?

I try to keep all my PRs current and have responded quickly to any and all feedback, but since that is such a scarce resource (and in my case, >95% of it is done by @jreback, it has to be acknowledged. So thanks for that! :) ), I have no problem doing several big PRs in parallel, which increases the total feedback throughput I receive and therefore lets me make faster progress.

In detail:

22225 is stalled in API discussion (see #24046)
22318 has been waiting for input from @jorisvandenbossche for 2 months now (haven't pushed that PR as not very important to me)
23167 is progressing (after some big precursors had to be dealt with)
23192 is stuck on a long chain of blocking PRs/issues: #23982 -> #23833 -> #23823 -> #23192
24050 is quite fresh
(I also closed #21645 as outdated since you looked)
and finally two other small PRs not worth mentioning

wesm commented 5 years ago

If we're changing the name, I'm wondering if it's worth creating yet-another-name versus using something somewhat standardized (COALESCE from SQL, for better or for worse)

h-vetinari commented 5 years ago

@wesm Thanks for taking the time to answer.

The pre-existing name is a strong argument for sure. While I care much more about the functionality than what name it resides under, I also see the appeal of not having the baggage of preconceptions that come with the name (e.g. coalesce does not have a notion of a join/overwrite in SQL land, or update being inplace, etc.)

This could look as follows in a putative docstring:

    def fuse(self, other, join='left', overwrite=True, filter_func=None,
             errors='ignore'):
        """
        Unite two DataFrames into one, filling non-NA values where possible.

        This method always aligns on index and columns. The parameters easily
        allow switching between different usage patterns, which sometimes have
        dedicated methods in other languages (or previous versions of pandas):

        * `update` (e.g. for python-dict):
            same result as `df.fuse(other)`, but `fuse` is not inplace.
        * `coalesce` (SQL):
            equivalent (entry-by-entry) to `df.fuse(other, overwrite=False)`.
        * `combine_first` (pandas before version 1.0):
            `df.combine_first(other)` is equivalent to
            `df.fuse(other, join='outer', overwrite=False) or
            `other.fuse(df, join='outer')`.

        Two additional keywords allow further tuning of the behavior, e.g. to
        raise if non-NA values coincide.

        Parameters
        ----------
        other : DataFrame, or object coercible into a DataFrame
            [rest is the same as for df.update (but with more joins)]

That being said, I'd be just as happy with coalesce.

wesm commented 5 years ago

The name "fuse" doesn't really connote the "overlay" aspect of combine_first to me, but I don't have super strong feelings about it

h-vetinari commented 3 years ago

It's really a pity that the .update story in pandas is still a wasteland over two years later (e.g. Series.update has no kwargs, and cannot be pipelined due to forcefully being in-place), much less providing something for coalesce or equivalent.

PS. I was prepared to implement all that along the lines of the docstring above (and do it right, going deep into the bowels of the type promotion to avoid painful inconsistencies for the user), but decided to spend my energies elsewhere after investing 100s of hours that ended up being repaid with what I can only describe as hostility (and - where my changes got taken over - basically zero attribution).

pwwang commented 2 years ago

@h-vetinari Is this something you desired: https://pwwang.github.io/datar/notebooks/coalesce/

pandas-dev / pandas

ENH: coalesce-method (upgrade for update/combine_first) #22812

22225 is stalled in API discussion (see #24046)

22318 has been waiting for input from @jorisvandenbossche for 2 months now (haven't pushed that PR as not very important to me)

23167 is progressing (after some big precursors had to be dealt with)

23192 is stuck on a long chain of blocking PRs/issues: #23982 -> #23833 -> #23823 -> #23192

24050 is quite fresh