[FEATURE]Add PPL Sanity Tests Job

Is your feature request related to a problem?

We need a comprehensive testing framework to validate PPL commands in the Spark environment, ensuring that each new PPL (Spark) release meets critical requirements such as:

Commands Sanity Tests: Ensuring each command behaves as expected and produces the correct output.
Performance Baseline: Establishing a baseline for performance to track improvements or regressions over time.
Backward Compatibility: Ensuring that newer PPL versions remain compatible with older releases.

This testing job should be deployable on any Spark-PPL compatible setup and should automate the dataset setup, reducing the friction for developers and testers. Additionally, this can evolve into a multi-step project, eventually introducing TPC-H-based performance benchmarking as well as extended validation scenarios.

What solution would you like?

We propose creating a Spark PPL Sanity Job that includes the following components:

Dataset Generator:

A mechanism to programmatically create or load datasets necessary for testing specific PPL queries. These datasets should be customizable in terms of schema, size, and content to support different testing needs (e.g., simple numeric datasets or complex multi-column datasets).
- Pre-built datasets for performance benchmarking (e.g., TPC-H datasets).

Endpoint API:

A CLI to trigger the testing process, configurable by:
- Input Parameters: Schema & Catalog names for Spark and OpenSearch integration.
- Test Scopes: Define the scope for tests: sanity checks, specific PPL commands, or performance-focused tests.
- Reporting: Define the format of the test results: detailed reports, performance summaries, or pass/fail results for sanity tests.

Extendable Framework:

Plugin Strategy: Developers should be able to extend the framework by adding new types of tests. This would involve a modular architecture where each test can be a standalone plugin or module. Grammar Extensions: It should be possible to add new PPL commands or grammar rules for testing without changing the architecture or packaging of the project - testing content should be defined as an additional resource to the project.

Multi-step Test Jobs:
- The framework should allow for incremental testing, starting with basic functionality and eventually supporting complex performance benchmarks such as TPC-H (or similar) as a future expansion.

Why is this needed?

Currently, there is no consistent, automated way to validate PPL behavior across Spark-PPL setups in a distributed, scalable fashion.
New releases should not only work correctly but also maintain high performance and backward compatibility.
Performance metrics are becoming increasingly crucial as PPL adoption grows, and having an easily extensible job will streamline regression testing and benchmarking.

Example Use Cases & Existing Test Frameworks:

Apache Spark SQL Testing: Apache Spark already supports extensive SQL-based testing via jobs designed for SQL query validation. We can use this model as a foundation. For instance, Spark SQL’s existing integration tests focus on validating the correctness of queries, performance with TPC-H benchmarks, and correctness across different environments (standalone, YARN, Kubernetes).
TPC-H Benchmarking: TPC-H is often used in SQL databases to evaluate the performance of complex queries across large datasets. This framework could be included in our PPL job setup to track the impact of each release on query performance and scalability.

Proposed Architecture:

Dataset Creation Step:
- We could include built-in dataset generators or simple dataset templates as part of the repository. These templates could serve both sanity tests (small, controlled datasets) and performance tests (larger, more complex datasets).
Parameterized Spark Job:
- The job could accept parameters (like catalog name, schema, dataset size, test types, etc.) and trigger the corresponding tests dynamically based on the configuration.
Modular Test Components:
- Sanity Tests: Ensure that each PPL command (e.g., eval, head, sort) returns the expected results on a known dataset.
- Performance Tests: Execute larger-scale queries and track runtime, memory usage, and throughput.
- Backward Compatibility: Automatically run tests on previous PPL versions to check for any regressions.

Do you have any additional context?

TPC-H: TPC-H is a performance benchmark for decision support systems that can serve as a solid baseline for testing query performance.
Apache Spark Testing Framework: Apache Spark provides rich testing capabilities for SQL queries, which can be used as inspiration for testing PPL commands in a Spark-PPL context.
OpenSearch SQL Integration Tests: Existing SQL integration tests in OpenSearch (like those for Elasticsearch) can be adapted to form the basis for PPL tests on Spark.

opensearch-project / opensearch-spark

[FEATURE]Add PPL Sanity Tests Job #718