substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.11k stars 147 forks source link

What is the level of targeted portability of Substrait? #596

Open ingomueller-net opened 5 months ago

ingomueller-net commented 5 months ago

I am trying to understand how rigorous Substrait aims to be with its specification and, hence, how portable it will eventually be. More concretely, do we expect exactly the same results independently of the backend that this runs on?

Note that in the SQL standard and most SQL dialects, this isn't the case at least with respect to the order of the results (including any unsorted intermediate result) even for different executions in the same system. If not, how exactly do we define the semantics and hence the set of "correct" results of a plan and how do we test implementations for compliance? On the issue of result order, the Substrait specification could leave the order undefined, like in the SQL standard, and one could probably come up with some acceptance tests that allow any valid execution.

However, there are other issues, such as the exact behavior of DECIMALs, which may result in different results unrelated to the order. (I just commented on ibis-project/ibis#8195, see details there.) If Substrait's specification said "the decimal operations do whatever the consumer does with decimals", then changing consumers might obviously result in a different query result. If it specified instead which exact bits need to be produced, it could be that some consumers need to emulate the specified behavior with more and/or more costly operations, which has its own downsides. What is Substrait's take here?

westonpace commented 5 months ago

Let me tackle this in a few parts:

Substrait does have some mandates regarding record order

Certain relations can specify "orderedness": https://substrait.io/relations/basics/#orderedness which is defined as:

A guarantee that data output from this operation is provided with a sort order. The sort order will be declared based on a set of sort field definitions based on the emitted output of this operation.

That being said, there are a few holes. For example, the read operator is not able to specify its ordering (often the input data can have a known ordering if it is written to disk in some sorted order).

For the most part, it's up to an engine on how to actually implement this. For example, a fetch (e.g. limit) relation, that receives data with a defined ordering, should return the top X results according to that ordering. Maybe this means the engine switches to single threaded execution. Maybe the engine runs in parallel with some kind of "batch index". Substrait does not care.

Another thing that Substrait does not mandate is how data is output from a query engine. For example, is it returned in memory in Arrow format? Is it written to some temporary file? So there is potentially some wiggle room for an engine to respect the ordering during plan execution and then deliver the results out of order but I would say this is bad form. I would think that, if the terminal / root relation has a well defined ordering, then a well behaved consumer should return data respecting that ordering.

The default comparison behavior for type classes is undefined but shouldn't be

Currently, Substrait does not define exactly how to compare items. For example, where should NaNs be placed when ordering floats? (total ordering is quite nice). This has been a long-standing TODO and it is something that we expect Substrait will eventually declare. For example, there are spots to specify a custom comparison function.

Substrait defines "type classes" and not "encodings"

Substrait's type classes do not mandate how the data is actually represented in memory (or on disk). A timestamp could be represented by a single 64 bit value or it could be a pair of 64 bit values. The only thing that Substrait mandates for most type classes is precision and range (e.g. if a column has an INT8 type then it must not allow a value of 128 and should trigger an overflow even if the underlying engine stores all integers as 64-bit values)

When it comes to the decimal type I think the precision and range is pretty straightforward.

Functions are tricky. We want to be clear but...

Specifying the exact behavior of functions is a daunting task. For a simple example we can consider integer addition. When an integer overflow occurs should the engine return "the highest possible value", "an error", "null", or "wrap around". Different engines today will have different behaviors. Also, "add" is an easy one. There are other behaviors which are not well defined by Substrait. What is the exact answer for a floating point sum operation? Because of floating point rounding errors the final answer will depend on the algorithm used (e.g. kahan summation). Consider the date type. Substrait currently states that the allowed range is year 1000 to year 9999. However, functions involving dates do not state what happens when this range is exceeded. Also, every single consumer that I am aware of is capable of representing a larger range than the one mandated by Substrait. As a result, these consumers probably won't return an error at all and will simply return a value that is outside the legal range.

There has been some attempt to define these corner case behaviors using options (there is an OVERFLOW option for the integer add function.). If a producer expects a certain behavior then they should declare the options when they create the plan. If a consumer receives a plan, and its options do not line up with what the consumer can deliver, then the consumer should reject the plan. There are also most likely cases where options are not sophisticated enough and we will need multiple functions.

This is still an ongoing effort. No producer that I am aware of actually sets these options. Most consumers ignore them completely.

I think the most complete work here has been in the BFT which attempts to map various SQL (not necessarily just Substrait) engines to specific function options. One potential path forward, instead of requiring consumers to specify options, is to allow consumers to make "blank plans" (no options specified) and then users can "localize" the blank plan to whatever dialect they are used to (e.g. what behaviors they expect) using the information from the BFT. This will still require consumers to actually respect the options and reject plans accordingly. An ever longer term goal would be to translate plans from one dialect to another.

Conclusion

Boiling this all down I would say that the prime directive for Substrait is to be pragmatic about what's possible to specify today while providing a path towards clear and definite plans in the future. Given this, I would say that the current state of things is that different engines/consumers can respond to plans in very different (and sometimes invalid) ways.

ingomueller-net commented 5 months ago

Thanks for the prompt and elaborate answer! This was exactly was I was looking for!

Substrait does have some mandates regarding record order

Certain relations can specify "orderedness": https://substrait.io/relations/basics/#orderedness which is defined as:

Thanks for that pointer. I hadn't looked at the specification in quite a while but I think it is quite clear now: the order is undefined except where specified otherwise. For me, this makes it clear what the specification mandates.

That being said, there are a few holes. For example, the read operator is not able to specify its ordering (often the input data can have a known ordering if it is written to disk in some sorted order).

That would make sense. But there is something in the spec that I would have interpreted otherwise:

Output Properties | Declaration of orderedness and/or distribution properties this read produces. | Optional, defaults to no properties.

The default comparison behavior for type classes is undefined but shouldn't be

Currently, Substrait does not define exactly how to compare items. For example, where should NaNs be placed when ordering floats? (total ordering is quite nice). This has been a long-standing TODO and it is something that we expect Substrait will eventually declare. For example, there are spots to specify a custom comparison function.

Good point!

BTW, it may be worth specifying explicitly what string equality means (if not done so already). Collations in SQL may affect string ordering and even string equality in quite unexpected ways!

Substrait defines "type classes" and not "encodings"

Substrait's type classes do not mandate how the data is actually represented in memory (or on disk). A timestamp could be represented by a single 64 bit value or it could be a pair of 64 bit values. The only thing that Substrait mandates for most type classes is precision and range (e.g. if a column has an INT8 type then it must not allow a value of 128 and should trigger an overflow even if the underlying engine stores all integers as 64-bit values)

When it comes to the decimal type I think the precision and range is pretty straightforward.

Right, I think I can related that to the specification. So the semantics are pretty well-defined. Just to confirm: this means that a Substrait consumer that has a different definition of, say, DECIMALs would have to emulate the one mandated by Substrait, right?

Functions are tricky. We want to be clear but...

Specifying the exact behavior of functions is a daunting task. For a simple example we can consider integer addition. When an integer overflow occurs should the engine return "the highest possible value", "an error", "null", or "wrap around". Different engines today will have different behaviors.

Good point! BTW: this should carry over to aggregation with sum, right?

Also, "add" is an easy one. There are other behaviors which are not well defined by Substrait. What is the exact answer for a floating point sum operation? Because of floating point rounding errors the final answer will depend on the algorithm used (e.g. kahan summation).

I'd argue that this is related to orderedness. The question arises for any function that is either not associative or not commutative (or both); floating-point addition is only one of them. Take array_agg as another one: what is the expected output here? I'd argue that, if the input order is unspecified, any permutation of the resulting arrays is valid; if the input order is specified, the order of the elements in the resulting arrays must correspond to their input order.

Applied to floating-point addition and respecting the fact that it is not associative, this would mean that only one result is valid if the input order is specified, namely that of summing the inputs in their input order. However, for performance reasons, one could argue that we should pretend that the summation is associative, in which case any summation order would be fine.

Just for completeness: Even integer addition is non-associative due to overflows. Whether or not an overflow occurs may depend on the summation order...

I am wondering if associativity and commutativity aren't something that is missing in the definition of AggregationFunction?

Consider the date type. Substrait currently states that the allowed range is year 1000 to year 9999. However, functions involving dates do not state what happens when this range is exceeded. Also, every single consumer that I am aware of is capable of representing a larger range than the one mandated by Substrait. As a result, these consumers probably won't return an error at all and will simply return a value that is outside the legal range. [...] I think the most complete work here has been in the BFT which attempts to map various SQL (not necessarily just Substrait) engines to specific function options.

Wow, I did not know about BFT but that looks extremely valuable! This is what I'd expect/hope for for the specification of functions in Substrait. Unless I am overlooking something, today, it isn't even stated explicitly that, say, 10 years + 10 years evaluates to 20 years, is it?

One potential path forward, instead of requiring consumers to specify options, is to allow consumers to make "blank plans" (no options specified) and then users can "localize" the blank plan to whatever dialect they are used to (e.g. what behaviors they expect) using the information from the BFT. This will still require consumers to actually respect the options and reject plans accordingly. An ever longer term goal would be to translate plans from one dialect to another.

This makes complete sense to me! A Substrait producer would have to select options such that the set of possible answers corresponds to what they expect, most likely depending on the semantics of their input language. While there would be multiple possible options at various points in the plan and each might restrict the possible set of answers to more than one (i.e., there may still be multiple different "correct" answers due to indeterminism or implementation choice), it would still be (theoretically) possible to distinguish correct from incorrect answers.

Conclusion

Boiling this all down I would say that the prime directive for Substrait is to be pragmatic about what's possible to specify today while providing a path towards clear and definite plans in the future. Given this, I would say that the current state of things is that different engines/consumers can respond to plans in very different (and sometimes invalid) ways.

What I think could be helpful here is an official test suite. Given a database and a Substrait plan, what is/are the expected result(s). The outcome of each test would probably one of "correct", "incorrect", and "unsupported"; the latter case letting engines report that they don't support a particular option. Deciding between correct and incorrect may not be trivial in many cases, namely, if several answers are acceptable; however, one can probably get quite far with very small inputs where it is still possible to simply enumerate all correct results.

To circle back to my original question: would such a level of portability the desired end goal of Substrait?

westonpace commented 5 months ago

To circle back to my original question: would such a level of portability the desired end goal of Substrait?

I'll start with the last question. In my opinion, yes. Though Substrait is a community project, others are free to have differing opinions. If I had to guess I think that most others are simply indifferent on this issue. It is viewed as too lofty a goal but innocuous.

That would make sense. But there is something in the spec that I would have interpreted otherwise:

Ah! Good catch. Looks like that isn't a hole after all.

BTW, it may be worth specifying explicitly what string equality means (if not done so already). Collations in SQL may affect string ordering and even string equality in quite unexpected ways!

Yes, I think we want to explicitly define default equality and ordering for all of the type classes (though maybe string is the only interesting one when it comes to equality).

Just to confirm: this means that a Substrait consumer that has a different definition of, say, DECIMALs would have to emulate the one mandated by Substrait, right?

If the types behave semantically different (e.g. infinite precision) then yes. If it's just different storage (e.g. variable-length binary) then it's fine. Although I don't expect emulation to be likely anytime soon (it's a supply and demand issue and the demand for Substrait isn't high enough yet). For now, I think the goal is to catch this and flag it. Users should be forced to say something like "allow non-standard types" or "allow alternative decimals".

If anyone ever ends up adding a sqlite consumer this type of thing will be a big headache :laughing: Integer overflow seamlessly transitions results into floats. We'd need to invent some kind of "number" user defined data type.

I'd argue that this is related to orderedness. The question arises for any function that is either not associative or not commutative (or both); floating-point addition is only one of them.

Any aggregate or window function that is either not associative or commutative. Scalar functions should yield the exact same answer regardless of the order of the inputs (although there is some contention / gray area here which is why seeded random is not yet a scalar function).

Also, non-associative scalar functions can technically cause different results when plans get restructured (e.g. during an optimization pass) but I think that is a separate issue entirely :dizzy_face:

Take array_agg as another one: what is the expected output here?

Note that, aggregate functions whose output obviously depends on the order (e.g. array_agg and not functions like sum where the dependency is more implicit) are typically expressed with an ordering.

In postgres you would do SELECT array_agg(a ORDER BY b DESC) FROM table;. In Substrait the equivalent feature is https://github.com/substrait-io/substrait/blob/d9b9672fd3c24285afdee9344fc2f4f7fcd70afb/proto/substrait/algebra.proto#L1475-L1479

Nothing in Substrait would prevent you from defining an ordering for sum but I don't know if any consumers would respect that today. They might if they have a consistent implementation of the sorts field.

What I think could be helpful here is an official test suite. Given a database and a Substrait plan, what is/are the expected result(s). The outcome of each test would probably one of "correct", "incorrect", and "unsupported"; the latter case letting engines report that they don't support a particular option. Deciding between correct and incorrect may not be trivial in many cases, namely, if several answers are acceptable; however, one can probably get quite far with very small inputs where it is still possible to simply enumerate all correct results.

https://github.com/substrait-io/consumer-testing

Also, BFT has its own suite of tests (though focused on expressions, not plans, and not using Substrait)

ingomueller-net commented 2 months ago

Just to keep track of this somewhere, and because it may be of interest for someone finding this issue: I learned about two related resources: