Open h-vetinari opened 6 years ago
Just about the name:
One disadvantage of this name: not being native speaking, and not coming from a database background, I would not even have the slightest idea what even the word itself would mean, let alone what it would do.
I know it is a term used in SQL (and which is a big plus for using such a name), and in the meantime I know it from there, but IMO it is a rather newcomer unfriendly name. In that sense, I find 'combine' or 'update' more natural terminology.
Then about the functionality: I think a coalesce
method would need to be rather like a combine_first
than as a update
, in terms of default overwrite
behaviour (which might not be what you want?), since the SQL coalesce is about the first non-missing value.
(sidenote in general: sorry for the slow progress at the moment in those discussions, it seems there are currently not many other core devs that have time to participate, and such discussions take time / need input of other people. That can certainly be annoying, but is part of the open source process ..)
@jorisvandenbossche
Thanks for the response! I understand that dev time is a very limited resource... ;-)
I don't feel strongly about the name. I'd be fine with having it under update
(which some people strongly oppose to being not inplace; hence this issue), and I'd be fine with having it as coalesce
(which has the SQL background; though you're right that the default for overwrite
should then be False).
I'd also be fine with naming it something shorter (and less specific than update/coalesce/combine_first
), like fuse
, which - in terms of language difficulty - should be IMO on par with melt/pivot/merge
, and would avoid clashing with people's pre-conceived notions about what capabilities the name implies (inplace/overwrite/join-style etc.). It's also nice and short to type.
https://www.dictionary.com/browse/fuse: [...] verb (used with object), fused, fus·ing.
The author skillfully fuses these
fragmentsDataFrames into a cohesive whole.
verb (used without object), fused, fus·ing. [...]
The two
groupsDataFrames fused to create one strong union.
In any case, my main point is about having the desired capabilities, not about a specific name.
Re-pinging @jreback @jorisvandenbossche @TomAugspurger @cpcloud @gfyoung @toobaz
@h-vetinari : Yikes! Sorry that we all went dark on this...
Given that you already had some backing already on this proposal from @jreback (and in some way from @jorisvandenbossche ), I would suggest that you try implementing this and open a PR. In terms of implementation, I'm inclined to agree that naming / behavior should be consistent with SQL.
If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be.
@pandas-dev/pandas-core @h-vetinari : As a side note, I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile...
Perhaps we / you might want to clean those up first for merging before opening yet another PR? 😉
@gfyoung
If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be.
My preference would be .fuse
, as outlined in https://github.com/pandas-dev/pandas/issues/22812#issuecomment-425079186
I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile... Perhaps we / you might want to clean those up first for merging before opening yet another PR?
I try to keep all my PRs current and have responded quickly to any and all feedback, but since that is such a scarce resource (and in my case, >95% of it is done by @jreback, it has to be acknowledged. So thanks for that! :) ), I have no problem doing several big PRs in parallel, which increases the total feedback throughput I receive and therefore lets me make faster progress.
In detail:
If we're changing the name, I'm wondering if it's worth creating yet-another-name versus using something somewhat standardized (COALESCE from SQL, for better or for worse)
@wesm Thanks for taking the time to answer.
The pre-existing name is a strong argument for sure. While I care much more about the functionality than what name it resides under, I also see the appeal of not having the baggage of preconceptions that come with the name (e.g. coalesce
does not have a notion of a join/overwrite in SQL land, or update
being inplace, etc.)
This could look as follows in a putative docstring:
def fuse(self, other, join='left', overwrite=True, filter_func=None,
errors='ignore'):
"""
Unite two DataFrames into one, filling non-NA values where possible.
This method always aligns on index and columns. The parameters easily
allow switching between different usage patterns, which sometimes have
dedicated methods in other languages (or previous versions of pandas):
* `update` (e.g. for python-dict):
same result as `df.fuse(other)`, but `fuse` is not inplace.
* `coalesce` (SQL):
equivalent (entry-by-entry) to `df.fuse(other, overwrite=False)`.
* `combine_first` (pandas before version 1.0):
`df.combine_first(other)` is equivalent to
`df.fuse(other, join='outer', overwrite=False) or
`other.fuse(df, join='outer')`.
Two additional keywords allow further tuning of the behavior, e.g. to
raise if non-NA values coincide.
Parameters
----------
other : DataFrame, or object coercible into a DataFrame
[rest is the same as for df.update (but with more joins)]
That being said, I'd be just as happy with coalesce
.
The name "fuse" doesn't really connote the "overlay" aspect of combine_first to me, but I don't have super strong feelings about it
It's really a pity that the .update
story in pandas is still a wasteland over two years later (e.g. Series.update
has no kwargs, and cannot be pipelined due to forcefully being in-place), much less providing something for coalesce or equivalent.
PS. I was prepared to implement all that along the lines of the docstring above (and do it right, going deep into the bowels of the type promotion to avoid painful inconsistencies for the user), but decided to spend my energies elsewhere after investing 100s of hours that ended up being repaid with what I can only describe as hostility (and - where my changes got taken over - basically zero attribution).
@h-vetinari Is this something you desired: https://pwwang.github.io/datar/notebooks/coalesce/
The state of
update/combine_first
inv0.23
:.update
signature does not match between DataFrame/Series (#22358)df.update
has ajoin
-kwarg that only supportsleft
, although the source code itself notes:# TODO: Support other joins
(#21855).update
is one of the (very) few pandas-methods that's inplace by default, but does not have aninplace
-kwarg (#22286).combine_first
is effectively (the not-yet-implemented).update(join='outer')
, has an awkward, non-standard name, and much fewer capabilities than.update
. (#21859)I tried to make some steps towards #21855 and #21859 by adding an
inplace
-kwarg todf.update
in #22286, which has been stalled in discussion whetherupdate
should ever be inplace at all, resp. how to move away from inplacing generally.Today, some headway was made with the comment by @jreback:
which I'm strongly in favour of (with the caveat that it should use the capabilities of
update
; I suggested something similar in #21855; would also solve most of the discussion there). And yes, dplyr uses "coalesce", which itself is inspired by SQL: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf#page.15This discussion is opened on the advice of @jreback, who would like to involve:
Also tagging the other participants of #21855: @gfyoung @toobaz
Summing up this proposal:
.coalesce
togeneric.py
, à la:def coalesce(self, other, join='left', overwrite=True, filter_func=None, raise_conflict=False):
which is not inplace and inherited by DataFrame/Seriesjoin='left'|'outer'|'inner'|'right'
(most of the discussion in #21855 is about potentially allowing different joins for different axes, and which keywords to use for that)..update
and.combine_first