pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.77k stars 17.97k forks source link

Please add force_suffixes to pandas.merge() #17834

Open sorenwacker opened 7 years ago

sorenwacker commented 7 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd

A = pd.DataFrame({'colname': [1, 2]})

B = pd.DataFrame({'colname': [1, 2]})

C = pd.DataFrame({'colname': [1, 2]})

D = pd.merge(A, B, right_index=True, left_index=True, suffixes=('_A', '_B'))

print(pd.merge(D, B, right_index=True, left_index=True, suffixes=('', '_C'))

>   colname_A  colname_B  colname
> 0          1          1        1
> 1          2          2        2

Problem description

When using pandas.merge() suffixes are only added to the column-names if a column-name appears in both data frames. With more than two data frames that can lead to the above situation. The first merge adds suffixes. When the resulting table is merged with a third table the column-name does not appear twice and the suffix is not added. The suffix then needs to be added manually. This is overhead, especially when larger numbers of data-frames are merged. An option "force_suffixes" would be appreciated that ensures the suffix is added.

Expected Output

>   colname_A  colname_B  colname_C
> 0          1          1        1
> 1          2          2        2
gfyoung commented 7 years ago

@soerendip : Thanks for reporting this! Could you explain why you would want the suffix to be forcefully added, even if duplicates don't exist? I'm having some difficulty understanding the use-case.

sorenwacker commented 7 years ago

@gfyoung: Say you have several tables with the identical column names. You want to combine them into one table while keeping track of the source. As in my example above, the third table did not get the label. You could use the f.add_suffix('_some_suffix'), I suppose, and merge the table afterwards.

ash13 commented 7 years ago

Hi, I want to work on this, can anyone help me? I am a beginner.

jreback commented 7 years ago

@ash13 I would try a different issue; this request has not gotten much commentary; i am -0 on it nowhere

ThomasProctor commented 7 years ago

This isn't coming from nowhere - there's a stackoverflow question on it. As of October 31, 2017, it has 10 up-votes.

gfyoung commented 7 years ago

@ThomasProctor : Thanks for pointing that out. I'm still not convinced that we should do this. You can easily call df.rename to rename any confusing columns.

ThomasProctor commented 7 years ago

Here's the kludge I've been using to do this. Maybe it will help with a fix. I make no claims to it being remotely well-engineered

def merge_force_suffix(left, right, **kwargs):
    on_col = kwargs['on']
    suffix_tupple = kwargs['suffixes']

    def suffix_col(col, suffix):
        if col != on_col:
            return str(col) + suffix
        else:
            return col

    left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
    right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
    del kwargs['suffixes']
    return pd.merge(left_suffixed, right_suffixed, **kwargs)
gfyoung commented 7 years ago

Maybe it will help with a fix.

@ThomasProctor : Thanks for sharing this. I think this would be a reasonable workaround for people to use if they so need to.

ThomasProctor commented 7 years ago

@gfyoung My use case right now is merging together 10 or so DataFrames, all with possible redundant column names. Trying to keep track of where columns came from is just annoying.

joshdempster commented 6 years ago

@gfyoung : +1 for this feature. Like others, I find having to rename dataframes to get suffixes annoying and unnatural. Either you have to modify in place (never the behavior I want), or you have to duplicate the frames. I'd actually argue that the default behavior of join should be to always append suffixes if they are supplied, but at least having the option would make merging multiple frames less unpleasant.

asavinov commented 5 years ago

This feature (enforcing suffixes) is definitely needed for all non-trivial merges in order to avoid ambiguity: currently I never know what column names I will get in the result frame. Ideally, it should be a default option: if I provide a suffix then it must be used.

TheRealPJC commented 5 years ago

I agree - this very useful.

elpablete commented 5 years ago

Having a df 'links' with source and destination columns. I need to add details about the source and the destination from df 'cities'. Easy: juast merge links and cities on the source and city_id and then merge the result again with cities, only this time on destination and city_id. What do I get? A little mess because the details of the source city are NOT marked with the '_s' suffix after the first merge, but the second merge ends with the '_d' suffix on all the columns from df cities.

links = merge(links, cities, left_on='source', right_on='city_id', suffixes=('', '_s'))
links = merge(links, cities, left_on='destination', right_on='city_id', suffixes=('', '_d'))

I don't really want to rename cities prior to the first merge and then again prior to the second. And I definitely do not want cities to have its column names altered after I do the double merge.

TomAugspurger commented 5 years ago

The original request seems reasonable. I'm not sure about the name force_suffixes, but I haven't come up with anything better.

On Mon, May 27, 2019 at 9:29 PM Juan Pablo Tamayo notifications@github.com wrote:

Having a df 'links' with source and destination columns. I need to add details about the source and the destination from df 'cities'. Easy: juast merge links and cities on the source and city_id and then merge the result again with cities, only this time on destination and city_id. What do I get? A little mess because the details of the source city are NOT marked with the '_s' suffix after the first merge, but the second merge ends with the '_d' suffix on all the columns from df cities.

links = merge(links, cities, left_on='source', right_on='city_id', suffixes=('', '_s')) links = merge(links, cities, left_on='destination', right_on='city_id', suffixes=('', '_d'))

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/17834?email_source=notifications&email_token=AAKAOIWN3KO7GN5FU2K5C5DPXSKH3A5CNFSM4D6OMT6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKZLNI#issuecomment-496342453, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIS3UKTFJNBIJNKSG73PXSKH3ANCNFSM4D6OMT6A .

lia-simeone commented 5 years ago

add_suffix is a palatable alternative to building this feature into merge I think. I agree it doesn't entirely address a use case with merging 10+ dataframes as 10+ calls to add_suffix isn't great. But it's workable and certainly better than alternatives I used in the past.

Binary-yev commented 4 years ago

yeah this feature would be great . say I have one total dataframe, I break it out by column value and then merge to another. every time i do that in a for loop would be great to add that column value I selected and add it to new columnd name. say product X mean, product Y mean, product Z mean

AntoineLamer commented 4 years ago

Hi !

Here is my use case : I'm working on record linkage. I have a reference dataset with fixed fields. (1) The user load its own dataset. (2) I perform record linkage operation. (3) At the end, I need to export the results and select the relevant fields.

As I don't know which fields the user will load, and will be duplicate with my reference dataset, I'd like to add suffixes for the two dataframes while merging. This would fix the problem of correctly selecting the field as I don't know in advance which one will be duplicate.

As reported above, I could suffixe all fields earlier, when loading the dataframes, but seems less convenient.

juandpinto commented 4 years ago

Any update on this feature? It's been a while since it was first suggested, but as others have pointed out, there are plenty of good use cases for it. Could have certainly used it on my current project with multiple merges!

jreback commented 4 years ago

pandas is all volunteer

you are welcome to submit a PR

joao-d-oliveira commented 3 years ago

Here's the kludge I've been using to do this. Maybe it will help with a fix. I make no claims to it being remotely well-engineered

def merge_force_suffix(left, right, **kwargs):
    on_col = kwargs['on']
    suffix_tupple = kwargs['suffixes']

    def suffix_col(col, suffix):
        if col != on_col:
            return str(col) + suffix
        else:
            return col

    left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
    right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
    del kwargs['suffixes']
    return pd.merge(left_suffixed, right_suffixed, **kwargs)

Just a small remark... it's a nice implementation, thanks. However it only works for single column "on". If you have multiple keys, it won't work.

You could make a small adaption: if col not in on_col: instead of: if col != on_col:

therealbct commented 2 years ago

Here's the kludge I've been using to do this. Maybe it will help with a fix. I make no claims to it being remotely well-engineered

def merge_force_suffix(left, right, **kwargs):
    on_col = kwargs['on']
    suffix_tupple = kwargs['suffixes']

    def suffix_col(col, suffix):
        if col != on_col:
            return str(col) + suffix
        else:
            return col

    left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
    right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
    del kwargs['suffixes']
    return pd.merge(left_suffixed, right_suffixed, **kwargs)

Just a small remark... it's a nice implementation, thanks. However it only works for single column "on". If you have multiple keys, it won't work.

You could make a small adaption: if col not in on_col: instead of: if col != on_col:

If you only replace != on_col with not in on_col, you'll encounter issues when merging on a single string name column which is inclusive of the name of other columns (e.g. using argument on='ab', while another column is named 'b'). If this happens, you'll end up with the default behavior of merge (e.g. do nothing, or add an _x or _y when columns are duplicates instead of enforcing the dedicated suffixes on those columns for which the name is included in the "on" column), therefore defeating the attempt of this workaround to circumvent the default behavior.

Updated version that works with both single and multi-column "on":

def merge_force_suffix(left, right, **kwargs):
    on_col = kwargs['on']
    on_col = [on_col] if not isinstance(on_col, (list, tuple)) else on_col
    suffix_tupple = kwargs['suffixes']

    def suffix_col(col, suffix):
        if col not in on_col:
            return str(col) + suffix
        else:
            return col

    left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
    right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
    del kwargs['suffixes']
    return pd.merge(left_suffixed, right_suffixed, **kwargs)
terryaa commented 9 months ago

I don't believe this is still open. need this feature so bad right now.. :(

ExarKun64 commented 1 month ago

Hi, I've been struggled with this issue for hours, trying to figure out why my suffixes was not applied. I think it would be nice to at least make this present behaviour of the suffixes argument appears more clearly in pandas API documentation. Maybe by adding an example, or something like that ? And I agree with the solution of a _forcesuffixes parameter, or something similar. Would be really nice to have, and would release and ambiguity or missunderstanding of the suffixes parameter.