ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index)

gwerbin commented 4 months ago

Feature Type

[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

I think there are some related issues on this topic, but none that clearly describes the particular use case I have in mind.

Consider data frame P with multi-index columns a, b, c and data frame Q with multi-index columns a, b.

Given some subset of rows from Q, I want to be able to use its .index to subset rows from P, without a lot of intermediate processing or boilerplate code.

In relational terms, this is a semi-join operation between P and the subset of rows from Q.

Feature Description

Setup:

P = pd.DataFrame(..., index=pd.MultiIndex(..., names=["a", "b", "c"]))
Q = pd.DataFrame(..., index=pd.MultiIndex(..., names=["a", "b"]))

assert not P.index.duplicated.any()
assert not Q.index.duplicated.any()

Q_sub = Q.head()

Desired functionality

Any of these, or something similarly convenient:

P_sub = P.loc[Q_sub.index]

P_sub = P.xs(Q_sub.index)

P_sub = P.xs(Q_sub.index, level=Q_sub.index.names)

P_sub = P.join_semi(Q_sub, how="semi")

Alternative Solutions

Something like this, I think?

P_sub = P.loc[pd.IndexSlice[Q_sub.index.get_level_values("a"), Q_sub.index.get_level_values("b"), :]]

In the case when Q_sub is exactly 1 row, we can also use .xs:

P_sub = P.xs(Q_sub.index[0], level=Q_sub.index.names)

Additional Context

samukweku commented 4 months ago

@gwerbin Can you share a reproducible example?

rhshadrach commented 4 months ago

Seems to me this is P.join(Q, how="right"), but as @samukweku said - please provide a reproducible example with the input and desired output.

gwerbin commented 4 months ago

It's not exactly the same as the right join because I'm not interested in any of the columns from Q, only subsetting indices. But I realize now that you could write it like this, which is a lot tidier than what I was doing before:

P.join(Q.loc[:, []], how="right")

I also had different behavior in mind for if there were duplicates in Q.index than what right join offers, but that wasn't part of my original example.

Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user.

The non-duplicate case

P = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 8, "u", -1.60),
        (2, 8, "v", -1.65),
        (1, 8, "u", -1.50),
        (1, 8, "v", -1.55),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

Q = pd.DataFrame(
    data=[
        (1, 9, 2.5),
        (2, 7, 3.5),
    ],
    columns=["a", "b", "y"],
).set_index(["a", "b"])

expected = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

pd.testing.assert_frame_equal(
    P.join(Q.loc[:, []], how="right"),
    expected,
)

The duplicate case

I didn't want to complicate things by including this example, but this is where what I want diverges from a typical right join.

Inputs:

P = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 8, "u", -1.60),
        (2, 8, "v", -1.65),
        (1, 8, "u", -1.50),
        (1, 8, "v", -1.55),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

Q = pd.DataFrame(
    data=[
        (1, 9, 2.5),
        (2, 7, 3.5),
        (2, 7, 4.5),
    ],
    columns=["a", "b", "y"],
).set_index(["a", "b"])

expected = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

pd.testing.assert_frame_equal(
    P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right"),
    expected,
)

Prior art

Note that Polars supports how="semi" join in the polars.DataFrame.join method: https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.join.html

They also support the similarly-useful how="anti" join, which was also not part of the original intended scope, but would IMO be useful in Pandas for similar reasons.

rhshadrach commented 4 months ago

But I realize now that you could write it like this, which is a lot tidier than what I was doing before:
P.join(Q.loc[:, []], how="right")

It can be slightly simpler with P.join(Q[[]], how="right").

Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user.

Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion.

rhshadrach commented 4 months ago

P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right")

Sorry - I missed this but my position is the same. pandas provides all the tools for you to accomplish the computation in a reasonable line of code.

ottothecow commented 4 days ago

Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion.

Are "filtering joins" not a common use case? Semis along with the matching anti-joins? Anti-join is the one that bothers me more though as it requires more code to replicate (and I use it very often).

Pyspark/spark implements left semi, right semi, left anti, and right anti.
Dplyr implements semi and anti joins (only left versions, but that's really all you need)
Polars implements them as @gwerbin mentioned.
SQL engines sort of implement them as EXISTS and NOT EXISTS (and talk about semi/anti conceptually in their docs), although their functionality is more trivial to implement in SQL syntax (SELECT left.* and either inner join or exclude where right.join_var IS NULL)
Julia implements left semi and left anti.

Personally, I always find it somewhat odd that the filtering join types are absent in Pandas. A simple and explicit syntax for the filtering joins seems like it would be beneficial here.

edit: and it is a further complication, but as @gwerbin mentioned, usually semi/anti joins are "safe" from multiple-merge issues--since the function knows you are only filtering your data it won't duplicate rows in the left where there are multiple matching results on the right. In my opinion, this is one of the big values to using them as it saves effort and makes your code more robust.

pandas-dev / pandas