kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

More general joining #68

Closed jankislinger closed 5 years ago

jankislinger commented 5 years ago

I have fixed a typo (right_in -> right_on) and reversed logic in one if statement in function that creates join parameters. See the changes here: https://github.com/kieferk/dfply/compare/master...jankislinger:fix-join-multiple-by

Now it can be used to join tables on columns with different names:

import pandas as pd
from dfply import *

a = pd.DataFrame({
    'x1': ['A', 'B', 'C'],
    'x2': [1, 2, 3]
})
b = pd.DataFrame({
    'x4': ['A', 'B', 'D'],
    'x3': [True, False, True]
})

a >> inner_join(b, by=('x1', 'x4'))

It would be also convenient to be able to use multiple by statements. For example expression

a >> inner_join(b, by=['x1', ('x2', 'x3')])

could be used as

a.merge(b, left_on=['x1', 'x2'], right_on=['x1', 'x3'])

If you agree I would modify the code and create a PR.

sharpe5 commented 5 years ago

Nice work! I remember I had trouble getting joins to work on multiple columns. I ended up computing a "key" column which was a string concatenation of three other columns, then joining on that. Perhaps this PR will make this workaround unnecessary.

On Tue, 4 Sep 2018 07:47 Jan Kislinger, notifications@github.com wrote:

I have fixed a typo (right_in -> right_on) and reversed logic in one if statement in function that creates join parameters. See the changes here: master...jankislinger:fix-join-multiple-by https://github.com/kieferk/dfply/compare/master...jankislinger:fix-join-multiple-by

Now it can be used to join tables on columns with different names:

import pandas as pd from dfply import *

a = pd.DataFrame({ 'x1': ['A', 'B', 'C'], 'x2': [1, 2, 3] }) b = pd.DataFrame({ 'x4': ['A', 'B', 'D'], 'x3': [True, False, True] })

a >> inner_join(b, by=('x1', 'x4'))

It would be also convenient to be able to use multiple by statements. For example expression

a >> inner_join(b, by=['x1', ('x2', 'x3')])

could be used as

a.merge(b, left_on=['x1', 'x2'], right_on=['x1', 'x3'])

If you agree I would modify the code and create a PR.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/68, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypD5n48_Si53piWQPApCUMiqCARs5ks5uXiHfgaJpZM4WYYqg .

jankislinger commented 5 years ago

The only issue I can see is if you use by as list of length 2, e.g. by=['x1', 'x2']. It is not clear whether you want to use x1 from left table and x2 from the right one or both columns from both tables. I would suggest using list for multiple columns and tuple for different names of the same column. The third option is to use dictionary (which is btw the closest to the implementation in dplyr).

List (two columns to join by, same names in bot data frames):

a >> inner_join(b, by=['x1', 'x2'])
a.merge(b, left_on=['x1', 'x2'], right_on=['x1', 'x2'])

Tuple (single column, different names):

a >> inner_join(b, by=('x1', 'x2'))
a.merge(b, left_on='x1', right_on='x2')
sharpe5 commented 5 years ago

By convention, by=['x1', 'x2'] would use both columns from both tables.

If you want to use column x1 from table 1 and x2 from table 2, then create a common column by setting x1 = x2 in table 2.

On Tue, 4 Sep 2018 08:25 Jan Kislinger, notifications@github.com wrote:

The only issue I can see is if you use by as list of length 2, e.g. by=['x1', 'x2']. It is not clear whether you want to use x1 from left table and x2 from the right one or both columns from both tables. I would suggest using list for multiple columns and tuple for different names of the same column. The third option is to use dictionary (which is btw the closest to the implementation in dplyr).

List (two columns to join by, same names in bot data frames):

a >> inner_join(b, by=['x1', 'x2']) a.merge(b, left_on=['x1', 'x2'], right_on=['x1', 'x2'])

Tuple (single column, different names):

a >> inner_join(b, by=('x1', 'x2')) a.merge(b, left_on='x1', right_on='x2')

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kieferk/dfply/issues/68#issuecomment-418267606, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOypJKRSeCk2cH8zRj9UqEtFs4SPNKpks5uXirCgaJpZM4WYYqg .

jankislinger commented 5 years ago

I agree that by=['x1', 'x2'] should use both columns in both tables. But having to rename column before join is annoying. That's why I would use tuples for that case.

See this commit: https://github.com/jankislinger/dfply/commit/2d892186eeda4f837e0f46a63ef45434b5c5b502

kieferk commented 5 years ago

Cool thank you. Will merge the PR now.