opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
14 stars 23 forks source link

[RFC] Automatic Workload-Driven Query Acceleration by OpenSearch #128

Open dai-chen opened 2 years ago

dai-chen commented 2 years ago

Is your feature request related to a problem?

In a database engine, there are different ways to optimize query performance. For instance, rule-based/cost-based optimizer and distributed execution layer tries to find best execution plan by cost estimate and equivalent transformation of query plan. Here we're proposing an alternative approach which is to accelerate query execution by materialized view for time-space tradeoff.

What solution would you like?

Architecture

Here is a reference architecture that illustrates components and the entire workflow which essentially is a workload-driven feedback loop:

  1. Input: Query plan telemetry collected
  2. Generating feedback: Feeding it into a workload-driven feedback generator
  3. Output: Feedback for optimizer to rewrite query plan in future

Basically, feedback is referring to various materialized view prebuilt (either online or offline) which hints acceleration opportunity to query optimizer.

AutoMV (1) (1)

There are 2 areas and paths moving forward for both of which lack open source solutions:

General Acceleration Workflow

1.Workload Telemetry Collecting

Collect query plan telemetry generated in query execution and emit it as feedback generation input.

2.Workload Telemetry Preprocessing

Preprocess query plan telemetry into uniform workload representation.

3.View Selection

Analyze workload data and select sub-query as materialization candidate according to view selection algorithm.

4.View Materialization and Maintenance

Materialize selected view and maintain the consistency between source data and materialized view data, by incrementally refreshing for example.

5.Query Plan Optimization

At last, query optimizer checks the existing materialized view and replace original operator with scan on materialized view.

chloe-zh commented 2 years ago

Hi @dai-chen the design is excellent, always appreciate it that people like me in the community has the opportunity to read such great design like this! But I have a couple of questions after reading it.

  1. From the architecture looks like the feedback analysis machine is independent of the query engine, so would any multi node interactions between them be in the scope of this solution?
  2. How would you manage the memory/cache for the materialize view? If I'm understanding right the materializations are stored in memory, and I guess it's fine for results of small size, but when it comes to large size of tables or intermediate results like large table JOINs, the materialization alone could be super expensive for memory use, is there any strategy so far to deal with the memory?
anirudha commented 2 years ago

thanks @chloe-zh for the comment. there is a long plan in the works, we will share the RFCs here shortly.. in short for now/ 1.) the SQL core will be a more formal federated query acceleration engine with support for autonomous materialized views. A lot to unpack in the prior statement, but thats where we are headed. The SQL core will run as an independent distributed engine and also part of opensearch; as the use-cases need it. We may probably need to rename the SQL core to a better name :)

2) we are researching and thinking about spark / hyperspace or we build our own tier. The priority will be open source solutions first.

-> OpenSearch will be used for join and intermediate materialized views.. -> no in memory stuff. OpenSearch data structures are awesome.

hope this answers some questions , we will post more details shortly here with some demo videos :)