substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.19k stars 155 forks source link

feat: add test file format #680

Open scgkiran opened 2 months ago

github-actions[bot] commented 2 months ago

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

EpsilonPrime commented 2 months ago

I'm guessing we're going to need to make sure that we have something that tests the validity of the test files. Would a parse test be sufficient enough for that purpose? Alternatively we could step it up and validate the tests against the function definitions ensuring that they refer to existing functions.

EpsilonPrime commented 2 months ago

It's a different use case. You're interested in testing whether a subset is compliant or not. BFT, until now, has been focused on showing all of the support (positive and negative) to provide a world view (the website). It can also be used as a way of showing total compliance which can be used to drive adoption.

On Wed, Aug 14, 2024, 21:56 Jacques Nadeau @.***> wrote:

@.**** commented on this pull request.

In tests/cases/arithmetic_decimal/power.test https://github.com/substrait-io/substrait/pull/680#discussion_r1717912566 :

@@ -0,0 +1,21 @@ +### SUBSTRAIT_SCALAR_TEST: 1.0 +### SUBSTRAIT_INCLUDE: extensions/functions_arithmetic_decimal.yaml + +# basic: Basic examples without any special cases +power(8::decimal, 2::decimal<38, 0>) = 64::fp64

I don't understand your comment Rich. The dialect should declare what option values are allowed for which functions. The test should skip test cases whose combination is not supported by a dialect. Why would we need to refer to test cases?

— Reply to this email directly, view it on GitHub https://github.com/substrait-io/substrait/pull/680#discussion_r1717912566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDDCF6AITUJN3CECANWI3ZRQYG3AVCNFSM6AAAAABMPQFXWCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMZZGY3DENJQGI . You are receiving this because you commented.Message ID: @.***>

jacques-n commented 2 months ago

It's a different use case...

I believe the original request was some way to refer to test cases individually. My main thought is that the case itself should probably be the "pointer" (or a hash of the case) as opposed to some kind of hand-defined naming system. From a substrait core perspective, I think there should be no nuance to describe "I support funcA with optionA and optionB but sometimes I return different results than what is defined in the test cases for funcA". To me, that's not funcA--it's an entirely different function. That system should say "I support a funcAprime which might be similar to funcA but has a different set of test cases".

My inclination is we should avoid trying to create some system that says "these portion of test cases for funcA also apply to funcAprime" Feels like an over-engineer.

EpsilonPrime commented 2 months ago

The usage is more like "this system doesn't support this function" sufficiently. A compatibility matrix lists all the systems that support a functionality and which ones that don't. While we could say that only these tests pass, the set of passes and failures provide a more complete story of compatibility.

jacques-n commented 2 months ago

There are some functions (maybe very few) where avoiding this will be challenging. [snip]... example is the divide function...

These are just different functions from my pov. We shouldn't be fearful of clearing documenting different divide functions (through options and/or variants). In fact, I see this as one of the key values of Substrait for people--saying there are 3 distinct divide functions in each of these systems and how each differs. (FWIW, in this particular example I see both using variant and/or options as reasonable.)

My point in general here is that whether a particular engine implements a particular function is a property of the dialect and not the function. Functions should be well defined outside of a specific instance of use.

Now, many people may be fine to do lossy translation between these divide functions and we should make that easy with substrait tooling. But I don't think we should confound that with the functions themselves. I'm generally of the perspective that options and different functions should solve these problems without having to introduce additional primitives.

For example, we may need to introduce the following kind of psuedo-mappings in substrait:

mappings:
  from: postgres.yaml:div(float,float)
  to:  duckdb.yaml:div(float,float) 
  differences:
    - divide by NAN => NAN in duckdb, NULL in postgres
    - ...
    - ...

Then someone using the two systems could get a list of lossy translations and evaluate whether that list is compatible with their use case. (Maybe okay for general analytics, not for financial analytics).

In many cases, I would suspect that we could also define expression trees that provide lossless conversions in these cases. Odds are those scenarios might be more compute expensive.