Open cstotzer opened 9 years ago
can u post a small but complete copy-pastable example
Sure, here goes
import pandas
import numpy
df_left = DataFrame([[1,3,4],[2,3,5],[NaN,2,8]], index=['a','b','c'], columns=['one', 'two', 'three'])
df_right = DataFrame([[4,2],[3,6]], index=['b','c'], columns=['one','two'])
# What I do as of now:
# ================
df_merged = merge(df_left, df_right, left_index = True, right_index = True, how='left')
# Coalescing two columns (like SQLs NVL/COALESCE functions)
df_merged['one'] = df_merged.one_x.fillna(df_merged.one_y)
# Setting a columns value depending on another (three might be the rowcount)
df_merged['two'] = df_merged.apply(lambda row: row.two_y if row.three <= 3 else row.two_x, axis=1)
# possibly many more of these
df_final = df_merged.drop(['one_x','one_y','two_x','two_y'], axis=1)
#what I would like to do:
# =================
def my_join_func(left, right):
out = Series()
out['one'] = right.one if left.one == NaN else left.one
out['two'] = right.two if left.three <= 3 else left.two
return out
df_final = merge(df_left, df_right, join_func=my_join_func, left_index = True, right_index = True, how='left')
# or
df_left.join(df_right, join_func=my_join_func, how='left')
this looks like https://github.com/pydata/pandas/issues/7480
yes? (though I like your example a bit better).
Not really, i'm looking for an easier way to specify what column to take from which dataset in the join operation like in the SELECT-Clause of a SQL join.
SELECT COALESCE(l.one, r.one) AS one,
CASE WHEN r.three <= 3 THEN r.two ELSE l.two END AS two,
three
FROM df_left AS l
LEFT JOIN df_right AS r ON l.index = r.index
having a joiner
function is not going to be efficient at all. Can you not prefilter?
You figure that
pd.merge(left, right, ...).apply(func)
is more efficient?
no, .apply
is not efficient. I mean that using a python function in the middle of a vectorized operation is not efficient
I mean which will do an if-then-else
result = result.where(......)
or you can do pd.merge(df_left.where(.....)....)
(e.g. pre-filter the unwanted values to NaN
)
@jreback @cstotzer should this be closed? Given that it's a goal of pandas to minimize complexity in function calls and this functionality exists (in a vectorized form) via DF/Series where functions, it seems like this is set.
Please, consider next example:
We have two dataframes, something like this:
# DF1 (each SOME_ID have few rows, sorted by date)
SOME_ID DATE F0 F1 F2
1516 775335.0 2019-09-16 625.0 10.0 2.5
1537 775335.0 2019-10-07 400.0 10.1 2.5
1565 775335.0 2019-11-04 521.0 9.8 2.5
1593 775335.0 2019-12-02 264.0 10.2 2.5
1628 775335.0 2020-01-06 566.0 9.6 5.0
...
9999 789969.0 2020-02-16 666.0 9.9 5.5
#DF2 (each SOME_ID have single row)
SOME_ID THRESHOLD_DATE
0 775335.0 2020-02-03
1 782756.0 2019-06-03
2 787437.0 2019-12-02
3 789498.0 2019-10-30
4 789969.0 2019-01-25
.. ... ...
i want to do something like that:
select *
from DF1 a
inner join DF2 b
on a.some_id = b.some_id
where b.threshold_date > a.date
In pandas I made this, because there is no condition join:
foo = []
for id, df_tmp in df1.groupby('SOME_ID'):
th_date = df2.query('SOME_ID == @id')['THRESHOLD_DATE'].iloc[0]
tmp = df_tmp.query('DATE >= @th_date')
foo.append(tmp)
final_df = pd.concat(foo)
And it is very slow, like 5-7 seconds for 15K rows dataset, because of cycle in it.
How do this thing in pythonic/pandas way without conditional join?
This thing ispeed is ~2sec for the same dataset:
(df1.groupby('SOME_ID').apply(lambda x: x[x['DATE'] >= df2.query('SOME_ID == @x.name')['THRESHOLD_DATE'].iloc[0]])).reset_index(drop=True)
But it is definitely is not more pythonic or more readable
xref https://github.com/pandas-dev/pandas/issues/7480 and https://github.com/pandas-dev/pandas/issues/10309 for specific cases of merging based on flexible conditions (that can be rewritten as a joiner function)
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.join.html pyspark join has the feature
Seems there are many issues related to conditional joining, and that this is the main one to track the issue, so I'll follow up here.
I recently opened (and closed due to being a duplicate issue of this and others) an issue with a proposed working pseudo-solution for this. The main idea is that you perform a lazy cross-join and then filter based on a provided function in a memory efficient way. In my example the "filter in a memory efficient way" was performed by breaking the 2 frames into chunks and cross joining/filtering the cartesian products of the chunks piece by piece so that you never blow up your memory.
However, as stated in that issue, I don't think chunking is the ultimate solution, that was just for quick demo purposes. I'm thinking it could instead use some optimized cython code, that runs through the cartesian product of rows of the 2 frames and performs the provided filter pair by pair (or at some other optimal interval).
In the proposed solution the api looks like this:
pd.lazy_cross_join(df, dates).loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]
Where lazy_cross_join returns a LazyMerge
result and the actual joining and filtering is deferred until the .loc
accessor is called on it. However, this idea would still work with the api originally proposed here:
pd.merge(df1, df2, condition=lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date), how='cross')
(where dfx passed into to condition lambda is the "deferred/lazy" cross join result)
Issue/proposed pseudo-solution below: https://github.com/pandas-dev/pandas/issues/41150
A conditional join, similar to pyspark, would be very useful. The pyspark join expression API is simple and clean.
@aa1371 @trianta2 pyjanitor has a conditional_join function that covers non equi joins. There is also a pending PR with significant performance improvement, which is implemented in numba. Some benchmark performance results are listed in the PR.
Would it be possible to add a parameter to pd.merge/DataFrame.join which accepts a function that does the actual merging of the two joined rows.
As I often use merge/join to impute missing or statistically unreliable data I find myself writeing code like the following over and over again:
Which is rather cumbersome and cluttered.