[RFC] Accelerating Spark Workloads by With Materialized Views

Problem Overview In modern database engines, query optimization is key to delivering fast, efficient results. Traditional approaches, such as rule-based or cost-based optimizers and distributed execution engines, work to identify the most efficient execution plan. However, these methods have limitations, particularly in complex, high-volume workloads. To address this, we propose an alternative solution: accelerating query execution through materialized views, leveraging a time-space tradeoff to improve performance.

Proposed Solution Our approach centers around an architecture that introduces a feedback loop between query plan telemetry and workload-driven optimizations. The solution uses materialized views to precompute and store intermediate query results, which can be reused by future queries to reduce execution time. This methodology can be applied in multiple database contexts, including OpenSearch and MPP/Data Lake engines like Spark, Presto, and Trino.

Key Components of the Architecture:

Workload: Collect telemetry data, such as execution plans, cardinality, runtime, and memory usage during query execution. These metrics will feed into the view selection process.
Telemetry Preprocessing: Transform collected telemetry into a uniform representation to standardize data across and optimize the view selection process.
Heuristic-Based View Selection: Analyze the workload and automatically identify sub-queries that benefit from materialization. We propose a simple heuristic algorithm that selects views based on estimated performance gains and materialization costs
Materialized View Creation and Maintenance: Materialized views are created and maintained to ensure data consistency. These views store intermediate results (e.g., summary aggregates, sub-query results) that can be reused to speed up query execution. Implementation Plan:

Focus on VPC Flow Logs: We plan to use VPC flow logs as test data, inserting them into an AWS OpenSearch cluster and generate workloads to test the architecture. We'll collect telemetry by listening to plan execution and gather metrics on cardinality, time, and memory usage. This data will be used to create a workload representation.

Heuristic-Based View Selection: We aim to implement a simple, heuristic-based algorithm for view selection. This algorithm will automatically select views that offer significant performance boosts with minimal materialization costs. By focusing on simplicity, we can ensure rapid implementation and real-time optimization during query execution.

opensearch-project / opensearch-spark

[RFC] Accelerating Spark Workloads by With Materialized Views #886