ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.08k stars 585 forks source link

feat: consolidate ibis.case(), Value.case(), Value.cases(), Value.substitute() #7280

Open NickCrews opened 11 months ago

NickCrews commented 11 months ago

Is your feature request related to a problem?

All 4 functions are basically doing the same thing. It feels to me like we should be able to merge them all into once function with the same API. It is confusing to users (including me!) which they should use, and is extra burden keeping the docstrings etc all correct.

Also, a long-running grossness of .case() is that it returns a CaseBuilder object, which is undocumented. I don't think we want to add this CaseBuilder object to the public API, I think it would be totally doable and much simpler if we could keep it to a single function call.

Describe the solution you'd like

I propose:

  1. removing ibis.case() and Value.case()
  2. keeping Value.cases() and Value.substitute(), and possible adding ibis.cases() (but I'm +0 on this). All of these have an API of
def cases(value: Value, mapping: Mapping[Any, Any] | Iterable[tuple[Any, Any]], default: Any = None):
   ...

(not sure about default vs else_, and if mapping is the best name)

where the keys can be

The values can be

I'm trying to think if there is some ambiguity if we accept both Values and BooleanValues similar to the ambiguity with selection vs filtering in t[_.x.isnull()], but I can't think of it.

I want to support the usecase of chained comparing one column to another:

(t.x == t.y).ifelse("same", (t.x > t.y).ifelse("greater", "less"))

I think this would be done with

t.x.case({lambda x: x == t.y: "same", lambda x: x > t.y: "greater"}, default="less")

but this feels a little gross. Makes me think either a top level or a table base API would be better:

t.cases({_.x == _.y: "same", _.x > _.y: "greater"}, default="less") ibis.cases(t, {_.x == _.y: "same", _.x > _.y: "greater"}, default="less")

I would want to make this

What version of ibis are you running?

main

What backend(s) are you using, if any?

all

Code of Conduct

cpcloud commented 11 months ago

+1 to consolidating cases and substitute.

removing ibis.case() and Value.case()

I don't think we should remove these. They are extremely familiar for users coming from SQL. We can likely consolidate all the various APIs under the Case ops though.

Also, a long-running grossness of .case() is that it returns a CaseBuilder object

The builder pattern is probably going to become more used in the codebase, especially where there's state that needs to be tracked. For example, I'm working on a way to better support chaining joins that uses this pattern. We're also thinking about how this might help clean up the GroupedTable object.

Agree that we should document any of these that are user-facing.

jcrist commented 11 months ago

I actually kinda like the idea of replacing the case builders with cases operations instead.

Say we had a top-level ibis.cases function to replace the ibis.case() builder with the following implementation:

def cases(*branches: tuple[ir.BooleanValue, ir.Value], default=None):
    cases, results = zip(*branches)
    return ops.SearchedCase(cases, results, default).to_expr()

Compare the following examples:

t = ibis.table({"sym": "str", "left": "int", "right": "int"})

# Existing `ibis.case`
expr1 = t.mutate(
    result=(
        ibis.case()
        .when(_.sym == "+", _.left + _.right)
        .when(_.sym == "-", _.left - _.right)
        .when(_.sym == "*", _.left * _.right)
        .when(_.sym == "/", _.left / _.right)
        .else_(0)
        .end()
    )
)

# New `ibis.cases`
expr2 = t.mutate(
    result=ibis.cases(
        (_.sym == "+", _.left + _.right),
        (_.sym == "-", _.left - _.right),
        (_.sym == "*", _.left * _.right),
        (_.sym == "/", _.left / _.right),
        default=0,
    )
)

I personally like the ibis.cases example and implementation better:

IMO I think we should deprecate the builders in favor of ibis.cases and Value.cases.

cpcloud commented 11 months ago

cases is probably similar enough to case that coming from SQL wouldn't be a huge lift +1

NickCrews commented 11 months ago

What do you both think about my proposal of supporting both boolean conditions and equality values?

Passing single values already works with Value.cases:

t.x.cases([
   (t.y, "equal to y"),
   (t.z, "equal to z"),
])

vs the more precise/verbose

t.x.cases([
   (_ == t.y, "equal to y"),
   (_ == t.z, "equal to z"),
])

Should this feature get propagated, or should we deprecate and drop it?

Similarly, what about passing in lambdas for the predicates?

Similarly, what about supporting Mapping in addition to the exisiting Iterable[tuple]?

jcrist commented 11 months ago

Should this feature get propagated, or should we deprecate and drop it?

I think that makes sense for Value.cases (where there's a concrete thing to compare against) but not for ibis.cases which is intended to be used inside a larger expression like Table.mutate. I definitely don't think any api should special case on boolean expressions, where if it isn't a bool already it's compared against a thing - that sounds like madness.

Similarly, what about passing in lambdas for the predicates?

Seems fine to me, although given how good deferred expressions have gotten I'm tempted to avoid introducing lambda support into new apis. :shrug: could go either way here.

Similarly, what about supporting Mapping in addition to the exisiting Iterable[tuple]?

Also seems fine. Only argument I'd have to avoid it is it's a bit weird to type/document/support variadic *branches as well as mapping/iterable-of-tuples. I think if we supported a mapping we should maybe ditch the variadic case? Something like:

def cases(branches: Mapping[Any, Any] | Iterable[tuple[Any, Any]], *, default: Any = None):
    ...

Again, no strong thoughts :shrug:.

NickCrews commented 11 months ago

I think that makes sense for Value.cases (where there's a concrete thing to compare against) but not for ibis.cases

Ok thanks for pointing that out, I felt there was something assymmetrical there but I couldn't put my finger on it. Annoying can't-have-it-all here: I would really like both flavors to have the exact same API, but at the same time I also want it to be a drop-in replacement for .substitute(), which implies that we need to support the non-boolean keys...Not sure how to resolve this.

that sounds like madness

I had this vague intuition there was some madness here, but I haven't been able to come up with the actual case. do you have an example that you want to avoid?

lambdas

I think this can get added as a followup without breaking anything, so maybe we leave it out for now.

I think if we supported a mapping we should maybe ditch the variadic case?

Oh I didn't think of a variadic *branches API! I like that better. I would vote for just that, and not supporting mappings. I also just remembered this other issue I filed, where Mappings come with the problem that keys need to be hashable, so you can't use plenty of Ibis types such as Deferreds as keys.

contang0 commented 11 months ago

will .case continue to work for a while after .cases is implemented? from user perspective I would like to have a bit of time to migrate the code to .cases

jcrist commented 11 months ago

yes, any work done here will have a deprecation period.