Open gwerbin opened 4 months ago
@gwerbin Can you share a reproducible example?
Seems to me this is P.join(Q, how="right")
, but as @samukweku said - please provide a reproducible example with the input and desired output.
It's not exactly the same as the right join because I'm not interested in any of the columns from Q
, only subsetting indices. But I realize now that you could write it like this, which is a lot tidier than what I was doing before:
P.join(Q.loc[:, []], how="right")
I also had different behavior in mind for if there were duplicates in Q.index
than what right join offers, but that wasn't part of my original example.
Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user.
P = pd.DataFrame(
data=[
(1, 9, "u", -1.70),
(1, 9, "v", -1.75),
(2, 8, "u", -1.60),
(2, 8, "v", -1.65),
(1, 8, "u", -1.50),
(1, 8, "v", -1.55),
(2, 7, "u", -1.40),
(2, 7, "v", -1.45),
],
columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])
Q = pd.DataFrame(
data=[
(1, 9, 2.5),
(2, 7, 3.5),
],
columns=["a", "b", "y"],
).set_index(["a", "b"])
expected = pd.DataFrame(
data=[
(1, 9, "u", -1.70),
(1, 9, "v", -1.75),
(2, 7, "u", -1.40),
(2, 7, "v", -1.45),
],
columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])
pd.testing.assert_frame_equal(
P.join(Q.loc[:, []], how="right"),
expected,
)
I didn't want to complicate things by including this example, but this is where what I want diverges from a typical right join.
Inputs:
P = pd.DataFrame(
data=[
(1, 9, "u", -1.70),
(1, 9, "v", -1.75),
(2, 8, "u", -1.60),
(2, 8, "v", -1.65),
(1, 8, "u", -1.50),
(1, 8, "v", -1.55),
(2, 7, "u", -1.40),
(2, 7, "v", -1.45),
],
columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])
Q = pd.DataFrame(
data=[
(1, 9, 2.5),
(2, 7, 3.5),
(2, 7, 4.5),
],
columns=["a", "b", "y"],
).set_index(["a", "b"])
expected = pd.DataFrame(
data=[
(1, 9, "u", -1.70),
(1, 9, "v", -1.75),
(2, 7, "u", -1.40),
(2, 7, "v", -1.45),
],
columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])
pd.testing.assert_frame_equal(
P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right"),
expected,
)
Note that Polars supports how="semi"
join in the polars.DataFrame.join
method: https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.join.html
They also support the similarly-useful how="anti"
join, which was also not part of the original intended scope, but would IMO be useful in Pandas for similar reasons.
But I realize now that you could write it like this, which is a lot tidier than what I was doing before:
P.join(Q.loc[:, []], how="right")
It can be slightly simpler with P.join(Q[[]], how="right")
.
Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user.
Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion.
P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right")
Sorry - I missed this but my position is the same. pandas provides all the tools for you to accomplish the computation in a reasonable line of code.
Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion.
Are "filtering joins" not a common use case? Semis along with the matching anti-joins? Anti-join is the one that bothers me more though as it requires more code to replicate (and I use it very often).
Personally, I always find it somewhat odd that the filtering join types are absent in Pandas. A simple and explicit syntax for the filtering joins seems like it would be beneficial here.
edit: and it is a further complication, but as @gwerbin mentioned, usually semi/anti joins are "safe" from multiple-merge issues--since the function knows you are only filtering your data it won't duplicate rows in the left where there are multiple matching results on the right. In my opinion, this is one of the big values to using them as it saves effort and makes your code more robust.
Feature Type
[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
I think there are some related issues on this topic, but none that clearly describes the particular use case I have in mind.
Consider data frame
P
with multi-index columnsa, b, c
and data frameQ
with multi-index columnsa, b
.Given some subset of rows from
Q
, I want to be able to use its.index
to subset rows fromP
, without a lot of intermediate processing or boilerplate code.In relational terms, this is a semi-join operation between P and the subset of rows from Q.
Feature Description
Setup:
Desired functionality
Any of these, or something similarly convenient:
Alternative Solutions
Something like this, I think?
In the case when
Q_sub
is exactly 1 row, we can also use.xs
:Additional Context
Maybe related to https://github.com/pandas-dev/pandas/issues/4036 and https://github.com/pandas-dev/pandas/issues/3057