kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

semi_join and anti_join fail when joining for more than one column #13

Closed anmiko closed 7 years ago

anmiko commented 7 years ago

semi_join and anti_join fail when joining for more than one column You can reproduce it with

df1 = pd.DataFrame({'x':[1,2,3,4,5], 'y':[10,20,40,50,100]})
df2 = pd.DataFrame({'x':[3,4], 'y':[40,51], 'z':[600,800]})
anti_join(df2, by =['x','y'])
#or anti_join(df2, by =[['x','y'],['x','y']])

left_join works fine with the same construction the error message is:

  df1 >> anti_join(df2, by =['x','y'])
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/base.py", line 45, in __rrshift__
    result = self.function(other_copy)
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/base.py", line 52, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/join.py", line 246, in anti_join
    other_reduced = other[right_on].drop_duplicates()
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/frame.py", line 2053, in __getitem__
    return self._getitem_array(key)
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/frame.py", line 2097, in _getitem_array
    indexer = self.ix._convert_to_indexer(key, axis=1)
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py", line 1217, in _convert_to_indexer
    indexer = check = labels.get_indexer(objarr)
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py", line 2286, in get_indexer
    indexer = self._engine.get_indexer(target._values)
  File "pandas/index.pyx", line 300, in pandas.index.IndexEngine.get_indexer (pandas/index.c:6420)
  File "pandas/src/hashtable_class_helper.pxi", line 793, in pandas.hashtable.PyObjectHashTable.lookup (pandas/hashtable.c:14637)
TypeError: unhashable type: 'list'

It looks like the problem is with else block in the code below (it's from function semi_join)

...
if not right_on:
        right_on = [col_name for col_name in df.columns.values.tolist() if col_name in other.columns.values.tolist()]
        left_on = right_on
    else:
        right_on = [right_on]
...

Pandas expects list of columns names but this block makes it list of list When else part removed it starts to work

kieferk commented 7 years ago

Thanks for catching this. I'll work on a fix tonight – looks like it won't be too bad.

kieferk commented 7 years ago

Sorry for the delay - I changed the else statement to make sure right_on was not already a list and this should work now in the just-now pushed version 0.2.4.