substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.16k stars 150 forks source link

[DISCUSS] Should we allow options to be declared at the top level of a function definition #673

Open jacques-n opened 1 month ago

jacques-n commented 1 month ago

The rounding blocks in this file: https://github.com/substrait-io/substrait/blob/main/extensions/functions_rounding.yaml

Make my eyes hurt and destroy my DRY soul. How about we add support for options to be declared at the same level as function name so these patterns aren't this awful...

Thoughts @EpsilonPrime @vbarua @westonpace ?

jacques-n commented 1 month ago

The argument descriptions feel similarly painful...

westonpace commented 1 month ago

There are some arithmetic functions where the options do not apply to all implementations (fp addition does not specify overflow behavior for example).

I don't know if there are any implementations where the arguments differ based on the implementation.

What if argument / option descriptions were at the function level but each implementation still listed which options / arguments apply?

    name: "round"
    description: >
      Rounding the value `x` to `s` decimal places.
    args:
          - name: "x"
            description: >
              Numerical expression to be rounded.
          - name: "s"
            description: >
              Number of decimal places to be rounded to.

              When `s` is a positive number, nothing will happen
              since `x` is an integer value.

              When `s` is a negative number, the rounding is
              performed to the nearest multiple of `10^(-s)`.    
    options:
      rounding:
        description: >
          When a boundary is computed to lie somewhere between two values,
          and this value cannot be exactly represented, this specifies how
          to round it.

            - TIE_TO_EVEN: round to nearest value; if exactly halfway, tie
              to the even option.
            - TIE_AWAY_FROM_ZERO: round to nearest value; if exactly
              halfway, tie away from zero.
            - TRUNCATE: always round toward zero.
            - CEILING: always round toward positive infinity.
            - FLOOR: always round toward negative infinity.
            - AWAY_FROM_ZERO: round negative values with FLOOR rule, round positive values with CEILING rule
            - TIE_DOWN: round ties with FLOOR rule
            - TIE_UP: round ties with CEILING rule
            - TIE_TOWARDS_ZERO: round ties with TRUNCATE rule
            - TIE_TO_ODD: round to nearest value; if exactly halfway, tie
              to the odd option.
    impls:
      - args:
          - value: i8
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: i8?
      - args:
          - value: i16
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: i16?
      - args:
          - value: i32
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: i32?
      - args:
          - value: i64
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: i64?
      - args:
          - value: fp32
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: fp32?
      - args:
          - value: fp64
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: fp64?

We could also leave the option's "values" at the function level (so each function just has an array of option names)

The only programmatic tool I know of that would be impacted is the BFT and I think it already reports options / arguments at the function level so there would be some benefit to align the two models.

richtia commented 1 month ago

I'm a fan of argument/option descriptions at the function level. We have a fun of other functions where all this information is put into the descriptions that's already at the function level, and it seems like it could be organized better.

Some of the string functions for example: https://github.com/substrait-io/substrait/blob/main/extensions/functions_string.yaml#L61

Blizzara commented 1 month ago

For the sake of discussion, an alternative would be to allow multiple types in the args' value field, or generic types. Though that might require allowing to specify output type in terms of an input type. I.e. something like:

name: "round"
    description: >
      Rounding the value `x` to `s` decimal places.
    impls:
      - args:
          - value: integer # alternative 1: specify a "logical" type
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: >
              Number of decimal places to be rounded to.

              When `s` is a positive number, nothing will happen
              since `x` is an integer value.

              When `s` is a negative number, the rounding is
              performed to the nearest multiple of `10^(-s)`.
        options:
          rounding:
            description: >
              When a boundary is computed to lie somewhere between two values,
              and this value cannot be exactly represented, this specifies how
              to round it.

                - TIE_TO_EVEN: round to nearest value; if exactly halfway, tie
                  to the even option.
                - TIE_AWAY_FROM_ZERO: round to nearest value; if exactly
                  halfway, tie away from zero.
                - TRUNCATE: always round toward zero.
                - CEILING: always round toward positive infinity.
                - FLOOR: always round toward negative infinity.
                - AWAY_FROM_ZERO: round negative values with FLOOR rule, round positive values with CEILING rule
                - TIE_DOWN: round ties with FLOOR rule
                - TIE_UP: round ties with CEILING rule
                - TIE_TOWARDS_ZERO: round ties with TRUNCATE rule
                - TIE_TO_ODD: round to nearest value; if exactly halfway, tie
                  to the odd option.
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: arg_typeof("x")?
      - args:
          - value: # alternative 2: specify a list of types
             - fp32
             - fp64 
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: >
              Number of decimal places to be rounded to.

              When `s` is a positive number, the rounding
              is performed to a `s` number of decimal places.

              When `s` is a negative number, the rounding is
              performed to the left side of the decimal point
              as specified by `s`.
        options:
          rounding:
            description: >
              When a boundary is computed to lie somewhere between two values,
              and this value cannot be exactly represented, this specifies how
              to round it.

                - TIE_TO_EVEN: round to nearest value; if exactly halfway, tie
                  to the even option.
                - TIE_AWAY_FROM_ZERO: round to nearest value; if exactly
                  halfway, tie away from zero.
                - TRUNCATE: always round toward zero.
                - CEILING: always round toward positive infinity.
                - FLOOR: always round toward negative infinity.
                - AWAY_FROM_ZERO: round negative values with FLOOR rule, round positive values with CEILING rule
                - TIE_DOWN: round ties with FLOOR rule
                - TIE_UP: round ties with CEILING rule
                - TIE_TOWARDS_ZERO: round ties with TRUNCATE rule
                - TIE_TO_ODD: round to nearest value; if exactly halfway, tie
                  to the odd option.
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: arg_typeof("x")?

If multiple args specify lists of types, then all combinations of those would be considered valid. This could be useful especially if the function takes two or more arguments and they don't necessarily need to be of the same type (though maybe then the output type gets complicated). The "logical" types could be e.g. integers (i8 .. i64), numerics (i8 .. i64, fp32, fp64, decimal?), strings (string, varchar, fixedchar).