substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.19k stars 155 forks source link

Value vs "expression" (lambda-ish) function arguments #320

Closed jvanstraten closed 1 year ago

jvanstraten commented 2 years ago

This topic has come up in various forms a few times now, and in retrospect has IMO been overlooked by the specification entirely thus far, despite functions being defined for either binding method.

When a value argument is bound to a function, there is a distinction between binding the expression itself (which I'll call an expression argument) and binding the data value returned by the expression (which I'll call a value argument). The difference here is that, for expression arguments, the function can determine whether it wants to evaluate the expression (or it may even choose to evaluate it multiple times), while it can't for value arguments.

These are two fundamentally different things when expressions can have side effects (like random(), for instance) or if we want to care about short-circuiting behavior of functions like and(a, b). Description of short-circuiting behavior is already used all over the specification (for example for if-then-else expressions), so I have to assume that we care about this. If we don't consider this in scope, i.e. we require that expressions never have side effects and leave it up to the consumer to determine whether it short-circuits (which is just an optimization, then), we should remove references to short-circuiting behavior or clarify that they are only recommendations for performance.

The difference is especially important for an optimizer, because things like common-subexpression elimination only preserve behavior for value arguments and for expressions known to not have side effects. On that note, we're also not currently specifying in the YAMLs whether a function has side effects to begin with.

If it helps to conceptualize this, the representation of an expression argument in a typical programming language would be a lambda function with no arguments. Later on, we could extend this concept to passing lambda expressions with arguments, which, I suppose, would lead to a new expression type to refer to the argument values set by the function (similar to FieldRef, or just an extension thereof using a different root type). A feature like that is fundamentally necessary to represent functions like sort_list(LIST<T>, |T, T| -> boolean) -> LIST<T> in a generic way, to let you write for example sort_list(x, less_than(ArgRef(0), ArgRef(1))), but also more complex things like sorting based on a particular field in a list of structs (sort_list(x, less_than(FieldRef(root=ArgRef(0), 1), FieldRef(root=ArgRef(1), 1))) to compare based on the second field of the struct, for example).

jacques-n commented 2 years ago

Operators in Substrait are functional set operations. Each operator take in N sets of data and output one, so there can't be side effects (at least how I would generally use the term). Functions like sort_list are also logically functional. Take in one list and produces another list.

Random doesn't have side effects, it is just non-deterministic.

The one exception (no pun intended) that I see as something that one could consider a side effect is exceptions. If I do '5'::int = 4 AND 'notanumber'::int=6, reversing the expressions to have 'notanumber'::int=6 AND '5'::int = 4 will result in a different behavior if we say AND can short-circuit. From my point of view, this is why we need to express the short-circuit behavior of AND. Substrait should be clear about this behavior (and predictable). An optimizer may decide to ignore this constraint since there are lots of benefits to eliding these operations. In Snowflake, for example, this distinction can be made using views versus secure views. I think we should express that the plan expectation is '5'::int = 4 AND 'notanumber'::int=6 is not the same as 'notanumber'::int=6 AND '5'::int = 4. More concretely, I see options:

jvanstraten commented 2 years ago

The one exception (no pun intended) that I see as something that one could consider a side effect is exceptions.

Ah, good point. I hadn't considered exceptions.

Random doesn't have side effects, it is just non-deterministic.

Depends on whether it's seeded or not. But I'm fine with requiring that functions don't have side effects aside from exceptions, in which case seeded random is immediately off the table.

Either way, the point I'm trying to make is that there is currently no way for a machine to know from just looking at a plan and associated YAML files whether something of for example the form a(b(field_ref)) can be converted to a project relation that does b(field_ref) followed by a later invocation of a(...) on the field added by the projection. If a doesn't always evaluate its argument, that changes the behavior, but for the vast majority of functions it does not. Currently this information is and can only be written in human-readable text in the description of a function. It seems like low-hanging fruit to me to just split the value argument type in two, especially since it only requires addition of a field in the YAML file and some docs to do so.

Functions like sort_list are also logically functional. Take in one list and produces another list.

You're glossing over the reason why I used sort there though, namely that sorts tend to be passed callbacks to pass a comparator function. We've largely been skirting around that issue because for aggregate function bindings sorts are special-cased, but that workaround isn't available for a scalar function operating on the list type. Though, FWIW, I think there's a lot of generalization work that can be done here, for example allowing aggregate and window functions or even complete subqueries to be run in scalar context based on nested list types.

julianhyde commented 2 years ago

A few DBMSs have what they call 'lambda support' or 'higher-order functions', including operations such as sort_list(LIST<T>, |T, T| -> boolean), but they only support lambdas that are constant, that is, specified inside the argument list.

True lambda support would allow function-values (lambdas and closures) that can be stored and processed in the same way as other values.

The examples of higher-order that I have seen - filter, map, sort - can all be implemented using a sub-query, albeit a little less concisely. I suggest that substrait doesn't yield to this fad of putting a functional-programming-like veneer on relational algebra, and instead waits for real first-class-functions to arrive.

jvanstraten commented 2 years ago

True lambda support would allow function-values (lambdas and closures) that can be stored and processed in the same way as other values.

That's a good point; that would be taking it one step further. In that case, I suppose you would also need an expression type that constructs a lambda out of another expression and an argument type pack, and a data type that might look something like LAMBDA<STRUCT<arg0_type, arg1_type, ...>, return_type>. I don't think doing it the "constant" way first would break a generalized solution later, though.

julianhyde commented 2 years ago

Yes, a function plus a (possibly empty) struct of argument values is a good way to implement lambdas and other closures.

The danger of doing it the "constant" way is that you bake in too many assumptions based on the constant case. For example:

If you don't steer clear of those assumptions you might have to break compatibility when true first-class functions arrive.

jvanstraten commented 2 years ago

Substrait already has several "argument types," only one of which actually binds to values that exists in the normal type system, so lambda arguments could just be added to that list without making things any less consistent than they already are. They would simply bind to an expression tree with a new argument reference expression type as the leaves, of which the return type is just the return type of the lambda (rather than the lambda itself). If we then later decide to add first-class lambdas, we'd just add a lambda literal that constructs a lambda out of the context of one of those special argument slots, at which point we can deprecate the special lambda function argument type (or choose to just support both).

It's certainly cleaner to do it right the first time, but that doesn't necessarily mean it's also better, because putting lambdas in the normal data type system that's also used for database schemas and the likes might be a bit alarming to the typical Substrait user because they're not really data to begin with. I don't think it's necessarily as complex to add to the spec as I originally thought, though.

westonpace commented 1 year ago

This issue appears to be covering two topics. First, clarifying our definition of short circuiting and, second, adding support for lambdas (expressions that are, themselves, arguments).

I think we should leave this ticket open for lambdas (as the title implies). We can clarify short-circuiting in a separate ticket should someone choose.

westonpace commented 1 year ago

Also, on the topic of lambdas, I am +1 on adding these and I agree that (if I am reading the above correctly) making a "lambda type class" is a good way to tackle this.

westonpace commented 1 year ago

I am closing this in favor of #349 as I think we probably only need one issue open for lambdas. However, both that issue and this issue have useful info.