substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.14k stars 148 forks source link

feat: add i32 return type for length functions #606

Open richtia opened 6 months ago

richtia commented 6 months ago

This PR adds an i32 return type for a few length functions based on existing engine support.

postgres, datafusion, and acero all have i32 return types for these functions. duckdb has i64

westonpace commented 6 months ago

Does having the same set of input arguments but different output arguments cause an issue?

David is right. We cannot overload functions based only on return type:

Every compound function signature must be unique. If two function implementations in a YAML file would generate the same compound function signature, then the YAML file is invalid and behavior is undefined.

richtia commented 6 months ago

I'm afraid you will need to create a new function. Perhaps char_length_32? I'm not sure we've had to tackle this yet so we don't have a convention.

That works for me...but i also guess to @EpsilonPrime's point, I'm wondering if it's too much of a problem to just have the producer/consumer handle any type of casting to i64 if needed in the situation we have today.

westonpace commented 6 months ago

That works for me...but i also guess to @EpsilonPrime's point, I'm wondering if it's too much of a problem to just have the producer/consumer handle any type of casting to i64 if needed in the situation we have today.

I'm good with only i64. It's more work on the consumer either way. Either they have to support two functions (one of them probably with a cast) or they need to recognize and add an appropriate cast so they can consume a non-matching function.

On the producer end of things I think, in most situations, the user doesn't care about this detail. If they do then they can insert a cast. A savvy consumer will hopefully be able to optimize the resulting internal plan of consumer_char_length(...) -> cast(i64) -> cast(i32) into my_char_length(...).

westonpace commented 6 months ago

If we got the only i64 route it might be nice to document this somewhere. It would be nice if there were a README.MD for https://github.com/substrait-io/substrait/tree/main/extensions the defined:

Blizzara commented 1 month ago

I was thinking about these return types recently, given that Spark and DataFusion disagree on the return types in many cases. I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

Given that: 1) we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.

2) a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).

3) Function calls in proto definitions already contain output types, like here. It would seem reasonable to me that the producer picks the function impl with the right args, but then writes the output type it has - and consumer would be responsible for adhering to that output type by casting if its implementation differs.

I don't know about the other consumers, but DataFusion doesn't use the extension's return type (well, doesn't use the extensions at all 😅 ) and the Substrait java lib has the validation as no-op as it's too hard to do properly.

westonpace commented 1 month ago

we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.

Isn't this how most languages work? Even those that allow overloading do not allow overloading based purely on return type.

a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).

This is true even if Substrait doesn't exist. If a producer and a consumer do not agree on how a function should behave then there is a problem.

Function calls in proto definitions already contain output types, like here. It would seem reasonable to me that the producer picks the function impl with the right args, but then writes the output type it has - and consumer would be responsible for adhering to that output type by casting if its implementation differs.

Yes, a consumer is free to act this way today. The return type in the YAML does not prevent this behavior. I assume most consumers will never consult the YAML. They will look at the fully qualified function name, the input types, and the output type, and then figure out if they have a kernel (usually a lookup in some kind of registry) that can satisfy the request.

I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

I think angst around return type "width" (e.g. i32 vs i64 vs u32 vs u64) is a symptom of a slightly different problem. In a perfect world there would only be one integer type (integer) and the width of that integer would be an implementation detail (I think bkietz sold me on this idea long ago). That's part of the reason I think i64 always is a perfectly fine approach.

I think there is still value in having the return type in the YAML. I think the main value is that the return type tells me the returned value is an integer and not a float / string / timestamp / etc.

Blizzara commented 1 month ago

we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.

Isn't this how most languages work? Even those that allow overloading do not allow overloading based purely on return type.

I think there's a difference in how we view Substrait's function (and maybe my view is incorrect 😅 ). Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

Difference being I've seen Substrait rather as a language for defining how a query engine (or a plan, really) does work, with the YAML files defining the different options.

Put in other words, the two options would be:

  1. Substrait (YAML) files define a behavior for query plan. Engines should behave like Substrait defines. vs
  2. Engines have existing behavior. Substrait should be able to represent that behavior.

I think both approaches would be valid goals. I would expect it to take a long time until engines like Spark would move towards compatibility with Substrait, so I see (2) as more useful approach, however, the good thing is with the simple- and advanced extensions (2) is quite doable for users (e.g. me) even if the projet's main goal is (1), so that's cool.

a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).

This is true even if Substrait doesn't exist. If a producer and a consumer do not agree on how a function should behave then there is a problem.

That's a different question. Even if the producer and consumer agrees on the output type, if the Substrait YAML has different output type, then there's nothing really that the producer can do to produce valid plans according to that YAML (note: the producer can not use the Substrait YAML and have its own, and that solves this problem).

I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

I think angst around return type "width" (e.g. i32 vs i64 vs u32 vs u64) is a symptom of a slightly different problem. In a perfect world there would only be one integer type (integer) and the width of that integer would be an implementation detail (I think bkietz sold me on this idea long ago). That's part of the reason I think i64 always is a perfectly fine approach.

I think there is still value in having the return type in the YAML. I think the main value is that the return type tells me the returned value is an integer and not a float / string / timestamp / etc.

There are differences apart from int width, for example Spark's floor/ceil return ints while Substrait specifies floats. But yea, would you agree the right answer here is to create new YAMLs for Spark, rather than even trying to fit it into the YAMLs in the Substrait repo?

westonpace commented 1 month ago

Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

I'm not entirely sure I follow. I view the proto files / substrait as defining operator behavior (in the world of relational algebra) and the YAML as defining function behavior (I don't know what this world is called :laughing:).

But yea, would you agree the right answer here is to create new YAMLs for Spark, rather than even trying to fit it into the YAMLs in the Substrait repo?

Yes, I think creating Spark-specific YAMLs would be a perfectly fine approach.

There are differences apart from int width, for example Spark's floor/ceil return ints while Substrait specifies floats.

Yes, these differences seem like they could cause trouble. For example, imagine the plan is SELECT (CEIL(0.3) / 2). If some engines return 0.5 and other engines return 0 then I think that is a bad thing. Spark should either reject the plan (I don't have a function ceil that behaves as you want) or convert substrait ceil into spark ceil + cast.

Blizzara commented 1 month ago

Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

I'm not entirely sure I follow. I view the proto files / substrait as defining operator behavior (in the world of relational algebra) and the YAML as defining function behavior (I don't know what this world is called 😆).

I think the question is do the YAMLs aim to define one target function behavior or do they aim to define all existing behaviors. With the former, it absolutely makes sense to have the return type (and to have only a single return type).

jacques-n commented 3 weeks ago

I think the question is do the YAMLs aim to define one target function behavior or do they aim to define all existing behaviors. With the former, it absolutely makes sense to have the return type (and to have only a single return type).

The way you asked this is unclear to me.

Each leaf argument/name combination in a yaml is trying to define the exact semantics of a function (which may exist in any of zero to many systems). If two different functions have different behavior, they should be two distinguished in one of the following ways. Note that for some kinds of variation in behavior you can choose from multiple choices. For some, you have limited choices.

Difference Via Options, same yaml & variant Separate Variants, same yaml Separate Yaml
Different output value semantics for same argument types and output type derivation Yes Yes Yes
Difference in input argument types or count No Yes Yes
Different output type semantics for same name and input types No No Yes