substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.11k stars 147 forks source link

Function names & URIs #634

Open westonpace opened 2 months ago

westonpace commented 2 months ago

This is in relation to some discussion that came up in #631

The duplicated names prevent substrait-java from being updated.

However, there is a question around whether names need to unique across ALL extensions, or just within a single extension file. While > these functions were added in error (I think) you could make the case that:

functions_arithmetic/min:timestamp,max:timestamp
functions_datetime/min:timestamp,max:timestamp

should be treated as different functions.

My understanding is that a fully qualified function name is the triple:

<function uri>,<function name>,<function arguments>

Therefore, yes, these are different functions (because the function uri is different). However, there are (at least) three problems that have never really been fully resolved:

  1. No one uses the function URI

The major producers that I am aware of today (isthmus, duckdb, ibis) either set the function URI to undefined, the empty string, or / (I think we actually have all three behaviors across the three producers :face_exhaling: )

Correspondingly, consumers tend to ignore this field. The one exception I'm aware of is Acero which will tolerate /, empty string, and undefined (Acero goes into a "fallback" mode where it does name-only matching and will match any registered function with the same name regardless of the URI) but which will accept URLs of the form https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml and also has a special URI urn:arrow:substrait_simple_extension_function which means "use the arrow compute function with the given name" (this is how we support UDFs).

  1. What is the canonical URL?

There are several choices. For example:

# Stable URI but potentially unstable definition
https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml
# Stable definition but unstable URI
https://github.com/substrait-io/substrait/blob/v0.47.0/extensions/functions_arithmetic.yaml

My preference is the former, for practical reasons.

  1. How is versioning handled?

This is discussed in more detail in https://github.com/substrait-io/substrait/issues/274

But the basic question is "what if we have tons of users and we decide to make a change to some function?"

westonpace commented 2 months ago

I'm not sure I expect a solid answer but figured it would be good to solicit discussion and opinions.

joellubi commented 2 months ago

I'm currently facing some design choices related to this and this is roughly where my thinking is at this point:

TLDR at bottom...

Generally, uniquely qualifying function names is not a problem that consumers have to deal with because the extension references are already resolved in the plan they are receiving.

For plan producers, it seems some specific preferences do need to be adopted in order to produce plans under practical constraints. Specifically, the input most producers are serializing into substrait will typically contain the function name and input types, but not the URI. Generally speaking this is coming from SQL but dataframes have the same issue.

In the statement SELECT a + b FROM c the function name can be determined by explicit mapping (i.e. + -> add) and the input types can be determined once the column references are resolved to the base schema. Assuming the signature resolves to add:i64_i64 most implementations will currently resolve the URI to be https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml but the choice is arbitrary under the current spec, as far as I can tell.

In general I think this is reasonable. If a plan producer wants to shadow the "default" add:i64_i64 with their own implementation, they can declare a new extension and handle their own disambiguation logic. Where this gets a little tricky is with the "default" extensions defined under https://github.com/substrait-io/substrait/blob/main/extensions. If there are multiple implementations of add:i64_i64 under different URIs that are NOT user-defined then the producer's choice of which one to use could end up being arbitrary and lead to surprises across implementations.

I think making this arbitrary choice is awkward but in practice most consumers don't do this at all which is a source of bugs. The original issue (#631) is an example and it looks like ibis-substrait has a similar silent bug where the second occurrence of a (name, inputs) pair in the extensions yaml will shadow the first.

TLDR

Given all this, I think one way to help simplify implementations would be to ensure that all functions defined under https://github.com/substrait-io/substrait/blob/main/extensions have a unique name:signature, so that the URI could be inferred from that pair. If the producer implementation additionally defines its own extension files then it is up to that implementation to decide which URI to prefer for a given function invocation.

Thoughts on this?

westonpace commented 2 months ago

Generally, uniquely qualifying function names is not a problem that consumers have to deal with because the extension references are already resolved in the plan they are receiving.

Consumers generally need to map the fully qualified name to an actual function implementation though. Here is a simple python example showing the pattern I usually see:

###
# This part of the code base was written before substrait integration was considered
###
def plus(a, b):
  return a + b
def minus(a, b):
  return a - b
def l2_norm(x, y):
  return math.sqrt(x*x + y*y)
def evaluate_function(func, x, y):
  expr = f"func({x}, {y})"
  eval(expr)
###
# Later, initial substrait support is added, and a mapping needs to be created
###
function_mapping = {
  "add": "plus",
  "subtract": "minus",
  // No mapping for l2_norm since it doesn't exist in substrait
}
def evaluate_substrait_func(substrait_function, x, y):
  local_function_name = function_mapping[substrait_function]
  return evaluate_function(local_function_name, x, y)

Notice that URIs are not involved (because I don't commonly see it). This can lead to a problem. If a producer decides to shadow "add" and creates urn:my_custom_producer:add which means "return the string formed by converting each arg to a string and concatenating it" then they would expect the above consumer to reject the plan with "function urn:my_custom_producer:add is not defined" but instead it will just evaluate the standard add function.

Assuming the signature resolves to add:i64_i64 most implementations will currently resolve the URI to be https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml but the choice is arbitrary under the current spec, as far as I can tell.

I do not agree the choice is arbitrary. For example, let's decide we want to fix the above consumer to properly implement function mapping:

###
# A fixed example of substrait support
###
standard_function_mapping = {
  "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml": {
    "add": "plus",
    "subtract": "minus",
  },
  "https://some_other_site/substrait/linear_algebra.yaml": {
    "l2_norm": "l2_norm"
  }
}
def evaluate_substrait_func(sub_func_uri, sub_func_name, x, y):
  local_function_name = function_mapping[sub_func_uri][sub_func_name]
  return evaluate_function(local_function_name, x, y)

Now, if the producer provides urn:my_custom_producer:add then we will get an error, as we should, since the consumer cannot evaluate it.

However, if the producer arbitrarily decides on https://github.com/substrait-io/substrait/blob/v0.47.0/extensions/functions_arithmetic.yaml and they pass in add then the consumer should be able to satisfy it (it knows how to do the standard add) but will reject it because it doesn't recognize the URI the producer provided. The consumer and producer need to agree on what URI is used for any given function (including the standard functions). This means the choice of URI is not arbitrary.

If there are multiple implementations of add:i64_i64 under different URIs that are NOT user-defined then the producer's choice of which one to use could end up being arbitrary and lead to surprises across implementations.

How would you know if a URI is user-defined or not?

I think one way to help simplify implementations would be to ensure that all functions defined under https://github.com/substrait-io/substrait/blob/main/extensions have a unique name:signature, so that the URI could be inferred from that pair.

I partly agree. This is also called out here:

Every compound function signature must be unique. If two function implementations in a YAML file would generate the same compound function signature, then the YAML file is invalid and behavior is undefined.

I think it is technically legal for a function signature to be duplicated as long as the duplicate is in a different yaml file. This is because the yaml filename should be part of the URI. However, I would still discourage it, and am fine saying we don't want it, since it can be misleading.

joellubi commented 2 months ago

Thanks for the response @westonpace. I pretty much agree with all of this, and would like to clarify that I don't think there's actually any ambiguity in the spec as it is currently defined: a function is uniquely specified by (uri,name,signature).

In practice many implementations don't handle URIs correctly, but the implications are different on the producer vs the consumer side. Consider an implementation executing SQL queries via substrait:

SQL -> Substrait Producer -> Substrait Consumer -> Query Engine

I believe you were focusing on the right-hand side in your response, where a substrait message is mapped to a physical plan for the query engine to execute. The reason I say there isn't ambiguity on the consumer side is because the plan it receives should already have an explicit URI in the Plan.ExtensionUris. Ignoring the URI is a mistake. Some implementations make this mistake, as you pointed out, but the guidance on the correct behavior is pretty clear IMO.

On the left side of the workflow a SQL query is getting planned and then serialized into a substrait message. Consider the query SELECT 1 + 2. Again, there is technically no ambiguity here. The producer may decide to map this to extensions/functions_arithmetic.yaml:add:i64_i64 or to shadow that implementation with its own e.g. my_extensions/functions_custom.yaml:add:i64_i64.

This only gets confusing IMO if we were to (hypothetically) define another extension file in the main substrait repo called for instance extensions/functions_math.yaml which also contained add:i64_i64. This is valid under the spec but it doesn't make it obvious to producers whether SELECT 1 + 2 should be serialized with a reference to extensions/functions_arithmetic.yaml:add:i64_i64 or extensions/functions_math.yaml:add:i64_i64. Again its the producer's choice which one it prefers but if we could at least avoid requiring this custom logic when supporting the standard extension files, it could help simplify adoption.

vbarua commented 2 months ago

Correspondingly, consumers tend to ignore this field.

So substrait-java doesn't ignore the field, as I discovered during our version update internally in which plans using max:tstz in functions_arithmetic.yaml no longer worked, failing with:

java.lang.IllegalArgumentException: Unexpected aggregate function with key max:tstz. The namespace /functions_arithmetic.yaml is loaded but no aggregate function with this key was found.

Another example where URIs are very important. Substrait defines the avg function matching the semantics of the SQL spec in functions_arithmetic.yaml as:

  - name: "avg"
    impls:
      - args:
          - name: x
            value: i64
        options:
          overflow:
            values: [ SILENT, SATURATE, ERROR ]
        nullability: DECLARED_OUTPUT
        decomposable: MANY
        intermediate: "STRUCT<i64,i64>"
        return: i64?

However, we might want to have a variant of avg like

  - name: "avg"
    impls:
      - args:
          - name: x
            value: i64
        options:
          overflow:
            values: [ SILENT, SATURATE, ERROR ]
        nullability: DECLARED_OUTPUT
        decomposable: MANY
        intermediate: "STRUCT<i64,i64>"
        return: fp64?   # <-- RETURNS FLOATING POINT

in functions_postgres which matches Postgres semantics engine does. (For extra cursed-ness these two variants have the same signature avg:i64)

A consumer supporting Postgres style avg only should reject any plan containing functions_arithmetic/avg:i64. A producer targeting this consumer should generate plans using functions_postgres/avg:i64

More generally, a producer should only support a function if its implementation semantics match what is given in the function extension definition. A consumer should only use functions that match the semantics that they are trying to encode.

If there are multiple implementations of add:i64_i64 under different URIs that are NOT user-defined then the producer's choice of which one to use could end up being arbitrary and lead to surprises across implementations.

I actually agree that this is somewhat arbitrary in that a generic producer can chose any version that matches the semantics they are trying to encode. I would argue within the core substrait extensions we should avoid having functions that are identical except for URI

vbarua commented 2 months ago

@joellubi

I believe you were focusing on the right-hand side

I'm definitely rhs focused as well 😅

For your example starting from SQL, consider something like:

SELECT avg(<i64 column>) FROM ...

From the SQL side of things, this avg could be either the one in functions_arithmetic or (the hypothetical) functions_postgres. What the producer chooses will be depend on what behaviour it wants and what behaviour the consumers it's talking to have available. I don't think there's a generally correct choice.

joellubi commented 2 months ago

From the SQL side of things, this avg could be either the one in functions_arithmetic or (the hypothetical) functions_postgres. What the producer chooses will be depend on what behaviour it wants and what behaviour the consumers it's talking to have available. I don't think there's a generally correct choice.

This is a good point @vbarua, especially since functions_postgres would actually be very useful to have defined!

Currently some producers are auto-generating function definitions from the contents of https://github.com/substrait-io/substrait/blob/main/extensions. The function mapping for the SQL query above works fine as long avg:i64 is only defined once in those extension files. If we were to introduce functions_postgres then any implementation of the core substrait extensions would become non-trivial, and likely harder to auto-generate. The producer's preference of i64 or fp64 semantics would either need to be encoded somewhere in the application logic or taken as input from the end user. For example:

import fancy_client_that_uses_substrait as fc

res_i64 = fc.query("SELECT avg(order_quantity) FROM orders")
res_fp64 = fc.query("SELECT avg(order_quantity) FROM orders", dialect="postgres")

In this case the producer has the explicit information it needs to choose the correct extension file (if it assumes functions_arithmetic to be the default). I think the question this ultimately raises is whether the core extensions defined in the main substrait repo are prescriptive or not. Are they just example extensions that you may or may not use, or do we want to encourage producers/consumers to support all or as many of them as possible? Is auto-generating implementations from the core extension files something we want to make easier, or is it an anti-pattern?