ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.2k stars 590 forks source link

feat: Support selectors as join predicates #8026

Open coroa opened 9 months ago

coroa commented 9 months ago

Is your feature request related to a problem?

No

Describe the solution you'd like

I'd like to be able to use selectors to specify on which columns to join.

f.ex. for two tables df1 (with columns model, scenario, value) and df2 (with columns model, year, value), I'd like:

df1.inner_join(df2, ~s.c("value"))

to be interpreted the same as

df1.inner_join(df2, ["model"])

ie. the selector should be interpreted as the intersection of the selected columns of both tables.

Refer also to discussion on zulip https://ibis-project.zulipchat.com/#narrow/stream/405265-tech-support/topic/.E2.9C.94.20selectors.20on.20joins .

What version of ibis are you running?

7.0.1

What backend(s) are you using, if any?

No response

Code of Conduct

cpcloud commented 9 months ago

Thanks for the issue!

I don't think we can imbue specific selectors with special meaning in certain contexts.

There are a number of downsides:

What about a new selector like s.overlaps(left, right) that computes the column intersection? I realize it's not as convenient, but it fits in to the existing paradigm without the downsides of the specialized interpretation, the main downside being verbosity I guess.

coroa commented 9 months ago

s.overlaps would not solve the example use case, would it?

Since i explicitly wanted to transport the meaning of ~s.c("value"), ie. shared columns except "value".

The only thing that is not well defined at the moment is how do selectors behave in regard to multiple tables, ie. how to define (maybe keep it at 2, rather than n tables at present) selector.expand(table1, table2).

For the sake of ease you could just define the join-mode as intersect generally, or you add a join mode to a selector (as field, defaulting to intersect) with a .join_as("union") overwrite method, which also takes ("left", "right", "union" or "intersect").

cpcloud commented 9 months ago

s.overlaps would not solve the example use case, would it?

Sure, alone it would not. I am suggesting that that is a composable piece that when combined with ~s.value(...) gets you what you want.

The only thing that is not well defined at the moment is how do selectors behave in regard to multiple tables

The devil is in the details here, but I'm not sure the work is justified without some more use cases. To be clear, I think your use case is important, I'm just not sure whether generalizing all selectors to work on 2+ tables is the right approach. It feels a bit like the nuclear option.

For the sake of ease you could just define the join-mode as intersect generally,

Yes, this is possible, but I'm not sure the tradeoff is worth it. This still has the downside of the same selector behaving differently in different contexts, as opposed to having a specific selector that gets you closer to what you want (a shorthand for specifying certain join key patterns) and composes with the rest of the selectors.

cpcloud commented 9 months ago

or you add a join mode to a selector (as field, defaulting to intersect) with a .join_as("union") overwrite method, which also takes ("left", "right", "union" or "intersect").

This is an interesting idea that might be worth exploring!

cpcloud commented 9 months ago

I'll do a bit of digging around and see how dplyr handles selectors (if it does!) in joins.

coroa commented 9 months ago

s.overlaps would not solve the example use case, would it?

Sure, alone it would not. I am suggesting that that is a composable piece that when combined with ~s.value(...) gets you what you want.

It's quite likely i did not get your example then.

What does s.overlaps(left, right) mean? ie what is left what is right?

I reviewed the selectors module now.

class Selector(Concrete):
    @abc.abstractmethod
    def expand(self, table: ir.Table) -> Sequence[ir.Value]:
        ...

What about:

class JoinSelector(Concrete):
    def expand(self, left: ir.Table, right: ir.Table) -> Sequence[str]:
        def names(tab: ir.Table):
            return [c.get_name() for c in self.selector.expand(tab)]

        mode = getattr(self, "mode", "intersect")

        if mode == "left":
            return names(left)
        if mode == "right":
            return names(right)
        if mode == "intersect":
            names_right = frozenset(names(right))
            return [n for n in names(left) if n in names_right]
        if mode == "union":
            names_left = names(left)
            return names_left + [n for n in names(right) if n not in names_left]

        raise ValueError('mode must be one of {"intersect", "union", "left", "right"')

def join(selector: Selector, mode="intersect"):
    return JoinSelector(selector=selector, mode=mode)

To allow

df1.inner_join(df2, s.join(~s.c("value")))

is that similar to what you had in mind with s.overlaps.

The tricky bits, which i don't like yet, are: This is not a Selector, since it does not follow the selector protocol, ie. the expand signature is:

expand(self, left: ir.Table, right: ir.Table) -> Sequence[str]

rather than

expand(self, table: ir.Table) -> Sequence[ir.Value]

But then this is somewhat necessary.

Maybe we would want to stick with an ir.Value based return-type, but then this would probably have to be one of Sequence[Tuple[ir.Value, ir.Value]] (or Tuple[Sequence[ir.Value], Sequence[ir.Value]]) to accomodate the different tables.

Are there other operations than join that interact with more than one table?

cpcloud commented 9 months ago

🤔 Very interesting.

Are there other operations than join that interact with more than one table?

The various set operations: t.union(s)/t.intersect(s)/t.difference(s). I'm not sure if there are specialized selectors that are necessary there though, for the operation to succeed at all the schemas of the inputs must match. Let's consider just join for the moment to keep the discussion focused.

cpcloud commented 9 months ago

Maybe we would want to stick with an ir.Value based return-type,

I think we should, if possible.

I don't think we need to worry about preserving the origin table in the structure of the return type, as that information is already available in the expression graph.

coroa commented 9 months ago

Ok, I quickly put up two similar draft implementations:

  1. 8027 implements the mentioned join "selector" with the different modes

  2. 8028 adds just a simple function expand_overlap to the selector module to expand the intersection

Both implementations end up with ir.Value tuples and use those in the _clean_join_predicates function to generate a working join predicate.

I like #8028 due to its simplicity, but i also do not see a use-case for the non-intersection modes.

coroa commented 9 months ago

I had a quick look at compatibility with the-epic-split on Saturday and i think the equivalent of hooking expand_overlap (or JoinSelector.expand) into _clean_join_predicates there is to modify https://github.com/ibis-project/ibis/blob/e851796fee255b78afc57c69748422d75dae6af1/ibis/expr/types/joins.py#L97-L130 and include a case as below:

        elif isinstance(pred, s.Selector):
            for left_value, right_value in s.expand_overlap(pred, left, right):
                # dereference the left value to one of the relations in the join chain
                left_value, right_value = dereference_sides(
                    left_value.op(), right_value.op(), deref_left, deref_right
                )
                yield comparison(left_value, right_value)
coroa commented 9 months ago

@cpcloud If i get a pointer to where tests should be situated, i'll set some up!

kszucs commented 8 months ago

Thanks @coroa for submitting the issue and working on this!

I would like to suggest you, that retargeting the feature against the-epic-split branch might be better, since it has a more robust join functionality. In addition to that, it may already support selectors because of using more robust bind() function.

Could you please put up a PR targeting the-epic-split base branch with one ore more test cases added to https://github.com/ibis-project/ibis/blob/the-epic-split/ibis/expr/tests/test_newrels.py? If it is not working already we can adjust the implementation based on your previous PRs.

coroa commented 8 months ago

Hi @kszucs ,

indeed, bind will expand the selectors separately at https://github.com/ibis-project/ibis/blob/e9aa348cb18c71c46be51bf66037f9f523335a88/ibis/expr/types/relations.py#L118C1-L119C39 , but fail to solve the example above.

The yield from of the expanded columns for the case ~s.c("value") will expand to the model and scenario values, but in prepare_predicates (the section you linked), there is only a single left_value and right_value allowed.

Nevertheless showing that in a new test case seems like a sensible way forward. I'll set one up!