pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.45k stars 17.86k forks source link

Enhancement: 'Joiner'-Function for pd.merge/DataFrame.join #8962

Open cstotzer opened 9 years ago

cstotzer commented 9 years ago

Would it be possible to add a parameter to pd.merge/DataFrame.join which accepts a function that does the actual merging of the two joined rows.

def my_joiner(left,right):
    left.a = right.a if left.a == NaN else left.a
    left.b = right.b if left.cnt < 3 else left.b
    ... 
    return left

pd.merge(df1,df2,joiner=my_joiner)

As I often use merge/join to impute missing or statistically unreliable data I find myself writeing code like the following over and over again:

df3 = pd.merge(df1,df2)
df3['a'] = df3.a_x.fillna(df3.a_y)
df3['b'] = df3.apply(lambda x: x.b_x if x.cnt > 3 else x.b_y)
...
df3 = df3.drop([a_x, a_y, b_x, b_y])

Which is rather cumbersome and cluttered.

jreback commented 9 years ago

can u post a small but complete copy-pastable example

cstotzer commented 9 years ago

Sure, here goes

import pandas
import numpy

df_left = DataFrame([[1,3,4],[2,3,5],[NaN,2,8]], index=['a','b','c'], columns=['one', 'two', 'three'])
df_right = DataFrame([[4,2],[3,6]], index=['b','c'], columns=['one','two'])

# What I do as of now:
# ================
df_merged = merge(df_left, df_right, left_index = True, right_index = True, how='left')
# Coalescing two columns (like SQLs NVL/COALESCE functions)
df_merged['one'] = df_merged.one_x.fillna(df_merged.one_y)
# Setting a columns value depending on another (three might be the rowcount)
df_merged['two'] = df_merged.apply(lambda row: row.two_y if row.three <= 3 else row.two_x, axis=1)
# possibly many more of these
df_final = df_merged.drop(['one_x','one_y','two_x','two_y'], axis=1)

#what I would like to do:
# =================
def my_join_func(left, right):
    out = Series()
    out['one'] = right.one if left.one == NaN else left.one
    out['two'] = right.two if left.three <= 3 else left.two
    return out

df_final = merge(df_left, df_right, join_func=my_join_func, left_index = True, right_index = True, how='left')
# or
df_left.join(df_right, join_func=my_join_func, how='left')
jreback commented 9 years ago

this looks like https://github.com/pydata/pandas/issues/7480

yes? (though I like your example a bit better).

cstotzer commented 9 years ago

Not really, i'm looking for an easier way to specify what column to take from which dataset in the join operation like in the SELECT-Clause of a SQL join.

SELECT COALESCE(l.one, r.one) AS one, 
       CASE WHEN r.three <= 3 THEN r.two ELSE l.two END AS two,
       three
  FROM df_left AS l
  LEFT JOIN df_right AS r ON l.index = r.index
jreback commented 9 years ago

having a joiner function is not going to be efficient at all. Can you not prefilter?

cstotzer commented 9 years ago

You figure that

pd.merge(left, right, ...).apply(func) 

is more efficient?

jreback commented 9 years ago

no, .apply is not efficient. I mean that using a python function in the middle of a vectorized operation is not efficient

I mean which will do an if-then-else

result = result.where(......)

or you can do pd.merge(df_left.where(.....)....) (e.g. pre-filter the unwanted values to NaN)

bscheetz commented 6 years ago

@jreback @cstotzer should this be closed? Given that it's a goal of pandas to minimize complexity in function calls and this functionality exists (in a vectorized form) via DF/Series where functions, it seems like this is set.

banderlog commented 4 years ago

Please, consider next example:

We have two dataframes, something like this:

# DF1 (each SOME_ID have few rows, sorted by date)
          SOME_ID        DATE   F0             F1              F2
1516     775335.0   2019-09-16  625.0          10.0            2.5   
1537     775335.0   2019-10-07  400.0          10.1            2.5   
1565     775335.0   2019-11-04  521.0           9.8            2.5   
1593     775335.0   2019-12-02  264.0          10.2            2.5   
1628     775335.0   2020-01-06  566.0           9.6            5.0
...
9999     789969.0   2020-02-16  666.0           9.9            5.5
#DF2 (each SOME_ID have single row)
         SOME_ID  THRESHOLD_DATE
0       775335.0  2020-02-03
1       782756.0  2019-06-03
2       787437.0  2019-12-02
3       789498.0  2019-10-30
4       789969.0  2019-01-25
..           ...         ...

i want to do something like that:

select *
from DF1 a
inner join DF2 b
  on a.some_id = b.some_id
where b.threshold_date > a.date

In pandas I made this, because there is no condition join:

foo = []
for id, df_tmp in df1.groupby('SOME_ID'):
    th_date = df2.query('SOME_ID == @id')['THRESHOLD_DATE'].iloc[0]
    tmp = df_tmp.query('DATE >= @th_date')
    foo.append(tmp)
final_df = pd.concat(foo)

And it is very slow, like 5-7 seconds for 15K rows dataset, because of cycle in it.

How do this thing in pythonic/pandas way without conditional join?

banderlog commented 4 years ago

This thing ispeed is ~2sec for the same dataset:

(df1.groupby('SOME_ID').apply(lambda x: x[x['DATE'] >= df2.query('SOME_ID == @x.name')['THRESHOLD_DATE'].iloc[0]])).reset_index(drop=True)

But it is definitely is not more pythonic or more readable

mroeschke commented 3 years ago

xref https://github.com/pandas-dev/pandas/issues/7480 and https://github.com/pandas-dev/pandas/issues/10309 for specific cases of merging based on flexible conditions (that can be rewritten as a joiner function)

qbit-git commented 3 years ago

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.join.html pyspark join has the feature

aa1371 commented 3 years ago

Seems there are many issues related to conditional joining, and that this is the main one to track the issue, so I'll follow up here.

I recently opened (and closed due to being a duplicate issue of this and others) an issue with a proposed working pseudo-solution for this. The main idea is that you perform a lazy cross-join and then filter based on a provided function in a memory efficient way. In my example the "filter in a memory efficient way" was performed by breaking the 2 frames into chunks and cross joining/filtering the cartesian products of the chunks piece by piece so that you never blow up your memory.

However, as stated in that issue, I don't think chunking is the ultimate solution, that was just for quick demo purposes. I'm thinking it could instead use some optimized cython code, that runs through the cartesian product of rows of the 2 frames and performs the provided filter pair by pair (or at some other optimal interval).

In the proposed solution the api looks like this:

pd.lazy_cross_join(df, dates).loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]

Where lazy_cross_join returns a LazyMerge result and the actual joining and filtering is deferred until the .loc accessor is called on it. However, this idea would still work with the api originally proposed here:

pd.merge(df1, df2, condition=lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date), how='cross')

(where dfx passed into to condition lambda is the "deferred/lazy" cross join result)

Issue/proposed pseudo-solution below: https://github.com/pandas-dev/pandas/issues/41150

trianta2 commented 3 years ago

A conditional join, similar to pyspark, would be very useful. The pyspark join expression API is simple and clean.

samukweku commented 2 years ago

@aa1371 @trianta2 pyjanitor has a conditional_join function that covers non equi joins. There is also a pending PR with significant performance improvement, which is implemented in numba. Some benchmark performance results are listed in the PR.