Open sorenwacker opened 7 years ago
@soerendip : Thanks for reporting this! Could you explain why you would want the suffix to be forcefully added, even if duplicates don't exist? I'm having some difficulty understanding the use-case.
@gfyoung: Say you have several tables with the identical column names. You want to combine them into one table while keeping track of the source. As in my example above, the third table did not get the label. You could use the f.add_suffix('_some_suffix')
, I suppose, and merge the table afterwards.
Hi, I want to work on this, can anyone help me? I am a beginner.
@ash13 I would try a different issue; this request has not gotten much commentary; i am -0 on it nowhere
This isn't coming from nowhere - there's a stackoverflow question on it. As of October 31, 2017, it has 10 up-votes.
@ThomasProctor : Thanks for pointing that out. I'm still not convinced that we should do this. You can easily call df.rename
to rename any confusing columns.
Here's the kludge I've been using to do this. Maybe it will help with a fix. I make no claims to it being remotely well-engineered
def merge_force_suffix(left, right, **kwargs):
on_col = kwargs['on']
suffix_tupple = kwargs['suffixes']
def suffix_col(col, suffix):
if col != on_col:
return str(col) + suffix
else:
return col
left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
del kwargs['suffixes']
return pd.merge(left_suffixed, right_suffixed, **kwargs)
Maybe it will help with a fix.
@ThomasProctor : Thanks for sharing this. I think this would be a reasonable workaround for people to use if they so need to.
@gfyoung My use case right now is merging together 10 or so DataFrames, all with possible redundant column names. Trying to keep track of where columns came from is just annoying.
@gfyoung : +1 for this feature. Like others, I find having to rename dataframes to get suffixes annoying and unnatural. Either you have to modify in place (never the behavior I want), or you have to duplicate the frames. I'd actually argue that the default behavior of join should be to always append suffixes if they are supplied, but at least having the option would make merging multiple frames less unpleasant.
This feature (enforcing suffixes) is definitely needed for all non-trivial merges in order to avoid ambiguity: currently I never know what column names I will get in the result frame. Ideally, it should be a default option: if I provide a suffix then it must be used.
I agree - this very useful.
Having a df 'links' with source and destination columns. I need to add details about the source and the destination from df 'cities'. Easy: juast merge links and cities on the source and city_id and then merge the result again with cities, only this time on destination and city_id. What do I get? A little mess because the details of the source city are NOT marked with the '_s' suffix after the first merge, but the second merge ends with the '_d' suffix on all the columns from df cities.
links = merge(links, cities, left_on='source', right_on='city_id', suffixes=('', '_s'))
links = merge(links, cities, left_on='destination', right_on='city_id', suffixes=('', '_d'))
I don't really want to rename cities prior to the first merge and then again prior to the second. And I definitely do not want cities to have its column names altered after I do the double merge.
The original request seems reasonable. I'm not sure about the name
force_suffixes
, but I haven't come up with anything better.
On Mon, May 27, 2019 at 9:29 PM Juan Pablo Tamayo notifications@github.com wrote:
Having a df 'links' with source and destination columns. I need to add details about the source and the destination from df 'cities'. Easy: juast merge links and cities on the source and city_id and then merge the result again with cities, only this time on destination and city_id. What do I get? A little mess because the details of the source city are NOT marked with the '_s' suffix after the first merge, but the second merge ends with the '_d' suffix on all the columns from df cities.
links = merge(links, cities, left_on='source', right_on='city_id', suffixes=('', '_s')) links = merge(links, cities, left_on='destination', right_on='city_id', suffixes=('', '_d'))
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/17834?email_source=notifications&email_token=AAKAOIWN3KO7GN5FU2K5C5DPXSKH3A5CNFSM4D6OMT6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKZLNI#issuecomment-496342453, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIS3UKTFJNBIJNKSG73PXSKH3ANCNFSM4D6OMT6A .
add_suffix
is a palatable alternative to building this feature into merge
I think. I agree it doesn't entirely address a use case with merging 10+ dataframes as 10+ calls to add_suffix
isn't great. But it's workable and certainly better than alternatives I used in the past.
yeah this feature would be great . say I have one total dataframe, I break it out by column value and then merge to another. every time i do that in a for loop would be great to add that column value I selected and add it to new columnd name. say product X mean, product Y mean, product Z mean
Hi !
Here is my use case : I'm working on record linkage. I have a reference dataset with fixed fields. (1) The user load its own dataset. (2) I perform record linkage operation. (3) At the end, I need to export the results and select the relevant fields.
As I don't know which fields the user will load, and will be duplicate with my reference dataset, I'd like to add suffixes for the two dataframes while merging. This would fix the problem of correctly selecting the field as I don't know in advance which one will be duplicate.
As reported above, I could suffixe all fields earlier, when loading the dataframes, but seems less convenient.
Any update on this feature? It's been a while since it was first suggested, but as others have pointed out, there are plenty of good use cases for it. Could have certainly used it on my current project with multiple merges!
pandas is all volunteer
you are welcome to submit a PR
Here's the kludge I've been using to do this. Maybe it will help with a fix. I make no claims to it being remotely well-engineered
def merge_force_suffix(left, right, **kwargs): on_col = kwargs['on'] suffix_tupple = kwargs['suffixes'] def suffix_col(col, suffix): if col != on_col: return str(col) + suffix else: return col left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0])) right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1])) del kwargs['suffixes'] return pd.merge(left_suffixed, right_suffixed, **kwargs)
Just a small remark... it's a nice implementation, thanks. However it only works for single column "on". If you have multiple keys, it won't work.
You could make a small adaption:
if col not in on_col:
instead of:
if col != on_col:
Here's the kludge I've been using to do this. Maybe it will help with a fix. I make no claims to it being remotely well-engineered
def merge_force_suffix(left, right, **kwargs): on_col = kwargs['on'] suffix_tupple = kwargs['suffixes'] def suffix_col(col, suffix): if col != on_col: return str(col) + suffix else: return col left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0])) right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1])) del kwargs['suffixes'] return pd.merge(left_suffixed, right_suffixed, **kwargs)
Just a small remark... it's a nice implementation, thanks. However it only works for single column "on". If you have multiple keys, it won't work.
You could make a small adaption:
if col not in on_col:
instead of:if col != on_col:
If you only replace != on_col
with not in on_col
, you'll encounter issues when merging on a single string name column which is inclusive of the name of other columns (e.g. using argument on='ab'
, while another column is named 'b'
). If this happens, you'll end up with the default behavior of merge (e.g. do nothing, or add an _x or _y when columns are duplicates instead of enforcing the dedicated suffixes on those columns for which the name is included in the "on" column), therefore defeating the attempt of this workaround to circumvent the default behavior.
Updated version that works with both single and multi-column "on":
def merge_force_suffix(left, right, **kwargs):
on_col = kwargs['on']
on_col = [on_col] if not isinstance(on_col, (list, tuple)) else on_col
suffix_tupple = kwargs['suffixes']
def suffix_col(col, suffix):
if col not in on_col:
return str(col) + suffix
else:
return col
left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
del kwargs['suffixes']
return pd.merge(left_suffixed, right_suffixed, **kwargs)
I don't believe this is still open. need this feature so bad right now.. :(
Hi, I've been struggled with this issue for hours, trying to figure out why my suffixes was not applied. I think it would be nice to at least make this present behaviour of the suffixes argument appears more clearly in pandas API documentation. Maybe by adding an example, or something like that ? And I agree with the solution of a _forcesuffixes parameter, or something similar. Would be really nice to have, and would release and ambiguity or missunderstanding of the suffixes parameter.
Code Sample, a copy-pastable example if possible
Problem description
When using pandas.merge() suffixes are only added to the column-names if a column-name appears in both data frames. With more than two data frames that can lead to the above situation. The first merge adds suffixes. When the resulting table is merged with a third table the column-name does not appear twice and the suffix is not added. The suffix then needs to be added manually. This is overhead, especially when larger numbers of data-frames are merged. An option "force_suffixes" would be appreciated that ensures the suffix is added.
Expected Output