ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.36k stars 540 forks source link

refactor(api): restrict arbitrary input nesting #8917

Closed kszucs closed 1 month ago

kszucs commented 1 month ago

Previously we allowed arbitrary nesting of input expressions for various API methods like table.join, table.group_by, etc. While this was backward compatible with 8.0.0 it also exposes additional ambiguity which can be confusing for users and difficult to reason about.

This change restricts the nesting of input expressions to a "single level" of nesting with the exception of the first positional argument which can be a list of expressions, but only if there are no more positional arguments. Examples:

t.select([t.foo, t.bar])  # OK
t.select(t.foo, t.bar)    # OK
t.select([t.foo], t.bar)  # Error
t.select(t.foo, name=t.bar)  # OK
t.select([t.foo], name=t.bar)  # OK

Aims to resolve https://github.com/ibis-project/ibis/issues/7819 and https://github.com/ibis-project/ibis/issues/8304, depends on https://github.com/ibis-project/ibis/pull/8884

kszucs commented 1 month ago

There are some ambiguities around the first arguments which can be a list of expressions:

table.select(["a", ["b"]])

# constructs

r0 := UnboundTable: table
  a int8
  b int16
  c int32
  d int64
  e float32
  f float64
  g string
  h boolean
  i timestamp
  j date
  k time

Project[r0]
  a:      r0.a
  ('b',): ['b']

a is interepreted as a column whereas ["b"] interpreted as an array literal.

kszucs commented 1 month ago

Apparently we also have a backend test case using a mapping as a single positional argument.

cpcloud commented 1 month ago

Why can't we keep

t.select([t.foo], t.bar)

?

It doesn't seem meaningfully different from allowing

t.select([t.foo], bar=t.bar)
kszucs commented 1 month ago

If we allow:

t.select([t.foo], t.bar)

then should we allow

t.select([t.foo], [t.bar])

as well?

cpcloud commented 1 month ago

I think so. It's really things like t.select([[...]], [[...]]) that should be disallowed.

kszucs commented 1 month ago

My preference would be to either provide args as variadics or as a single list-like argument. Even then we have confusion when to treat list-like inputs as literals vs value list.

jcrist commented 1 month ago

We talked today during triage and decided that at minimum we want to remove the support for arbitrarily nested expressions (as Phillip noted above).

I actually agree with the initial proposal at the top (only accept lists of column refs if select is a 1-arg function, otherwise treat it as variadic). Coincidentally this matches how polars select works, so there's some ecosystem-consistency there. The trick with doing this as a breaking change though is that previously working code may instead silently be treated as an array literal.

With this PR, what happens here? Does this error, or does it silently coerce these to array literals?

t.select(["a", "b"], ["c", "d"])  # is this two array literals or an error?

t.select("a", "b", ["c", "d"])   # is ["c", "d"] an array literal, or does this error?

Either way, I think we definitely want to forbid the case that phillip brought up before 9.0. I'm not opposed to the full changes here either, I think they make things more consistent, but we can always do that later IMO in 10.0.

kszucs commented 1 month ago
t.select(["a", "b"], ["c", "d"])  # is this two array literals or an error?

yes, two array literals, not erroring

t.select("a", "b", ["c", "d"])   # is ["c", "d"] an array literal, or does this error?

yes, an array literal, not erroring

kszucs commented 1 month ago

I would prefer deciding what should be the ideal long term behaviour then extend that to be backward compatible and either raise a deprecation warning or provide a config option to keep the previous behaviour.

cpcloud commented 1 month ago

Let's keep the full changeset here, and make sure to have both a BREAKING CHANGE bit as well as documentation on the behavior in the relevant APIs (select and mutate).

kszucs commented 1 month ago

Another thing I'm not sure about is whether we should do the dereferencing here or not? Like should t.bind(totally_different_table.col) raise?

cpcloud commented 1 month ago

Seems like it might keep bind more flexible and easier to maintain if we dereference elsewhere.

kszucs commented 1 month ago

Actually it could help code consolidation. I can try it after the dereferencer improvement gets merged:

cpcloud commented 1 month ago

We should probably kill destructure() now that proper alternatives (lift) are available.

cpcloud commented 1 month ago

I'm going to rip out destructure in a separate PR.

kszucs commented 1 month ago

I'm going to rip out destructure in a separate PR.

Updated the test case to expect an input error rather than an integrity error. We should update unwrap_aliases() to raise ibis input error instead of integrity error too.

kszucs commented 1 month ago

One confusing case:

In [4]: table = ibis.table({"a": "int", "b": "string"})

In [5]: table.select(["a", ["b"]])
Out[5]:
r0 := UnboundTable: unbound_table_1
  a int64
  b string

Project[r0]
  a:      r0.a
  ('b',): ['b']
cpcloud commented 1 month ago

@NickCrews Can you review this again? This is a remaining blocker for the 9.0 release which we'd like to get out early next week.

NickCrews commented 1 month ago

I just took a long look.

Most things look great, a few nits, but the one substantial thing is the t.bind([1,"a"]) behavior. I have an inline comment above. Can someone fill me in on the context there a little more? I don't think this is a blocker for me.

cpcloud commented 1 month ago

@NickCrews Thanks for doing a deep dive on tests. There were indeed a few cases that weren't tested and also the lambda _: 2 case wasn't yet implemented. Appreciate the feedback!

NickCrews commented 1 month ago

Cool this looks great then!

cpcloud commented 1 month ago

@NickCrews Can you approve the PR?

NickCrews commented 1 month ago

Sorry I was on the mobile app where the approval status isn't even shown so i didn't know there was anything to do