Open coroa opened 9 months ago
Thanks for the issue!
I don't think we can imbue specific selectors with special meaning in certain contexts.
There are a number of downsides:
What about a new selector like s.overlaps(left, right)
that computes the column intersection? I realize it's not as convenient, but it fits in to the existing paradigm without the downsides of the specialized interpretation, the main downside being verbosity I guess.
s.overlaps
would not solve the example use case, would it?
Since i explicitly wanted to transport the meaning of ~s.c("value")
, ie. shared columns except "value".
The only thing that is not well defined at the moment is how do selectors behave in regard to multiple tables, ie. how to define (maybe keep it at 2, rather than n tables at present)
selector.expand(table1, table2)
.
For the sake of ease you could just define the join
-mode as intersect
generally, or you add a join
mode to a selector (as field, defaulting to intersect) with a .join_as("union")
overwrite method, which also takes ("left", "right", "union" or "intersect").
s.overlaps would not solve the example use case, would it?
Sure, alone it would not. I am suggesting that that is a composable piece that when combined with ~s.value(...)
gets you what you want.
The only thing that is not well defined at the moment is how do selectors behave in regard to multiple tables
The devil is in the details here, but I'm not sure the work is justified without some more use cases. To be clear, I think your use case is important, I'm just not sure whether generalizing all selectors to work on 2+ tables is the right approach. It feels a bit like the nuclear option.
For the sake of ease you could just define the join-mode as intersect generally,
Yes, this is possible, but I'm not sure the tradeoff is worth it. This still has the downside of the same selector behaving differently in different contexts, as opposed to having a specific selector that gets you closer to what you want (a shorthand for specifying certain join key patterns) and composes with the rest of the selectors.
or you add a join mode to a selector (as field, defaulting to intersect) with a .join_as("union") overwrite method, which also takes ("left", "right", "union" or "intersect").
This is an interesting idea that might be worth exploring!
I'll do a bit of digging around and see how dplyr handles selectors (if it does!) in joins.
s.overlaps would not solve the example use case, would it?
Sure, alone it would not. I am suggesting that that is a composable piece that when combined with
~s.value(...)
gets you what you want.
It's quite likely i did not get your example then.
What does s.overlaps(left, right)
mean? ie what is left
what is right
?
I reviewed the selectors
module now.
class Selector(Concrete):
@abc.abstractmethod
def expand(self, table: ir.Table) -> Sequence[ir.Value]:
...
What about:
class JoinSelector(Concrete):
def expand(self, left: ir.Table, right: ir.Table) -> Sequence[str]:
def names(tab: ir.Table):
return [c.get_name() for c in self.selector.expand(tab)]
mode = getattr(self, "mode", "intersect")
if mode == "left":
return names(left)
if mode == "right":
return names(right)
if mode == "intersect":
names_right = frozenset(names(right))
return [n for n in names(left) if n in names_right]
if mode == "union":
names_left = names(left)
return names_left + [n for n in names(right) if n not in names_left]
raise ValueError('mode must be one of {"intersect", "union", "left", "right"')
def join(selector: Selector, mode="intersect"):
return JoinSelector(selector=selector, mode=mode)
To allow
df1.inner_join(df2, s.join(~s.c("value")))
is that similar to what you had in mind with s.overlaps
.
The tricky bits, which i don't like yet, are: This is not a Selector
, since it does not follow the selector protocol, ie. the expand
signature is:
expand(self, left: ir.Table, right: ir.Table) -> Sequence[str]
rather than
expand(self, table: ir.Table) -> Sequence[ir.Value]
But then this is somewhat necessary.
Maybe we would want to stick with an ir.Value
based return-type, but then this would probably have to be one of Sequence[Tuple[ir.Value, ir.Value]]
(or Tuple[Sequence[ir.Value], Sequence[ir.Value]]
) to accomodate the different tables.
Are there other operations than join that interact with more than one table?
🤔 Very interesting.
Are there other operations than join that interact with more than one table?
The various set operations: t.union(s)
/t.intersect(s)
/t.difference(s)
. I'm not sure if there are specialized selectors that are necessary there though, for the operation to succeed at all the schemas of the inputs must match. Let's consider just join for the moment to keep the discussion focused.
Maybe we would want to stick with an ir.Value based return-type,
I think we should, if possible.
I don't think we need to worry about preserving the origin table in the structure of the return type, as that information is already available in the expression graph.
Ok, I quickly put up two similar draft implementations:
join
"selector" with the different modesexpand_overlap
to the selector module to expand the intersectionBoth implementations end up with ir.Value
tuples and use those in the _clean_join_predicates
function to generate a working join predicate.
I like #8028 due to its simplicity, but i also do not see a use-case for the non-intersection modes.
I had a quick look at compatibility with the-epic-split
on Saturday and i think the equivalent of hooking expand_overlap
(or JoinSelector.expand
) into _clean_join_predicates
there is to modify https://github.com/ibis-project/ibis/blob/e851796fee255b78afc57c69748422d75dae6af1/ibis/expr/types/joins.py#L97-L130
and include a case as below:
elif isinstance(pred, s.Selector):
for left_value, right_value in s.expand_overlap(pred, left, right):
# dereference the left value to one of the relations in the join chain
left_value, right_value = dereference_sides(
left_value.op(), right_value.op(), deref_left, deref_right
)
yield comparison(left_value, right_value)
@cpcloud If i get a pointer to where tests should be situated, i'll set some up!
Thanks @coroa for submitting the issue and working on this!
I would like to suggest you, that retargeting the feature against the-epic-split
branch might be better, since it has a more robust join functionality. In addition to that, it may already support selectors because of using more robust bind()
function.
Could you please put up a PR targeting the-epic-split
base branch with one ore more test cases added to https://github.com/ibis-project/ibis/blob/the-epic-split/ibis/expr/tests/test_newrels.py? If it is not working already we can adjust the implementation based on your previous PRs.
Hi @kszucs ,
indeed, bind
will expand
the selectors separately at https://github.com/ibis-project/ibis/blob/e9aa348cb18c71c46be51bf66037f9f523335a88/ibis/expr/types/relations.py#L118C1-L119C39 , but fail to solve the example above.
The yield from
of the expanded columns for the case ~s.c("value")
will expand to the model
and scenario
values, but in prepare_predicates
(the section you linked), there is only a single left_value
and right_value
allowed.
Nevertheless showing that in a new test case seems like a sensible way forward. I'll set one up!
Is your feature request related to a problem?
No
Describe the solution you'd like
I'd like to be able to use selectors to specify on which columns to join.
f.ex. for two tables
df1
(with columnsmodel
,scenario
,value
) anddf2
(with columnsmodel
,year
,value
), I'd like:to be interpreted the same as
ie. the selector should be interpreted as the intersection of the selected columns of both tables.
Refer also to discussion on zulip https://ibis-project.zulipchat.com/#narrow/stream/405265-tech-support/topic/.E2.9C.94.20selectors.20on.20joins .
What version of ibis are you running?
7.0.1
What backend(s) are you using, if any?
No response
Code of Conduct