kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

joining on different columns does not work #61

Closed Make42 closed 5 years ago

Make42 commented 5 years ago

I think joining on different columns does not work. By that I mean

a_df = pd.DataFrame.from_items([('one', [1,2,3]),('two',['a','b','c'])])
b_df = pd.DataFrame.from_items([('three', [1,2,3]),('four',['d','e','f'])])
a_df >> inner_join(b_df,by=['one','three'])

gives the error

  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'one'

and

a_df >> inner_join(b_df,by=[['one'],['three']])

gives

IndexError: list index out of range

sharpe5 commented 5 years ago

Try creating a single new key column which is a combination of the key columns, then join on this new key column.

Does this work?

On Tue, 31 Jul 2018 16:04 Make42, notifications@github.com wrote:

I think joining on different columns does not work. By that I mean

a_df = pd.DataFrame.from_items([('one', [1,2,3]),('two',['a','b','c'])]) b_df = pd.DataFrame.from_items([('three', [1,2,3]),('four',['d','e','f'])]) a_df >> inner_join(b_df,by=['one','three'])

gives the error

File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'one'

and

a_df >> inner_join(b_df,by=[['one'],['three']])

gives

IndexError: list index out of range

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/61, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypKWK8xPccVAEptQQ4OzNV2B7yd6Jks5uMHHYgaJpZM4VobuN .

kieferk commented 5 years ago

This was indeed a bug. Should be fixed now, pull down the master branch and check it out, let me know if you have additional issues.

Make42 commented 5 years ago

Thank you! Please push to Anaconda if possible.

steer629 commented 5 years ago

I can confirm in 0.3.3, issue still same