substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.19k stars 155 forks source link

Proposal to define a test file format #681

Open scgkiran opened 2 months ago

scgkiran commented 2 months ago

Define a simple and human-readable test file format. It should be easy to parse programmatically. Define ANTLR grammar for the test file format. This will enable parser generation for multiple languages. Define format of literals for various data types. It should include both simple and complex data types It should cover test format for all types of functions Scalar/Aggregate/Windows This will allow to build a tool to report coverage for each of the substrait functions

scgkiran commented 2 months ago

Created a draft PR #680

EpsilonPrime commented 2 months ago

I have a few questions about the proposed test file format mainly stemming from not knowing what the format's intended use would be.

jacques-n commented 2 months ago

Will answer some high level questions on these. We can discuss more in the sync tomorrow.

What use cases would this test format handle that isn't already covered by the test format in substrait-io/bft (which handles tests of functions)?

The intention is for this to supplant the BFT test files. These files define the semantics of the functions are really an extended part of the documentation of function semantics. We have updates we're working on so the BFT framework would source these for testing. The current format in BFT is extremely verbose, making it burdensome to build test cases and difficult to accurately assess scan test cases and clarify the quality of the coverage.

Is there provision for testing how relations work as defined in substrait-io/consumer-testing?

We're focused on these one at a time. We figured we'd start with scalar functions then move through other things one at a time. I'm not sure we'd be likely to get to relations.

Do the use cases require cross-language compatibility? If the cross-language capability is required would a protobuffer definition suffice or is an ANTLR grammar truly necessary?

As mentioned above, we brainstormed several iterations. The main thing we struggled with is that the human observability/intuitiveness aspect gets lost the moment you try to make focus on making this easy to consume for a machine.

Should the test file format need to live in the specification repository (here)?

Yes, I think it should. It is a specification as much as a test. This is this function behaves under certain conditions. I think of it as simply an extension of the specification in the yaml files to help clear up ambiguities.

We have some intention to additionally introduce a coverage tool which gives us an overview of the number of test cases per function. Most of our functions are underspecified at the moment. Once we get to coverage, I would recommend that new functions must include clear specified behavior for common cases, edge cases, as well different option combinations. Otherwise, we aren't really specifying anything concrete.