Query Compilers and Query Optimization

Traditional Database Query Optimization

Typically, a query compiler translates an input SQL query into the optimal query execution plan that will be executed on the machine, and it will collect all the results to send them back to the query. Usually, this consists of the following steps:

Parsing: Convert the SQL query to the Abstract Syntax Tree (AST)
Logical Optimization: Apply some rule-based optimizations to the AST (e.g. filter push-down)
Physical Optimization: Select the optimal execution strategies (e.g. index selection, join algorithms, join ordering) to provide the optimal execution plan. Usually, this involves a cost model for estimating the cost of operations and intermediate plans.
Code Generation: Generate the code of the execution plan
Execute and fetch the results.

Cross-Database Query Optimization

Nowadays, it is common to have data stored in multiple data sources. Users, need a way to integrate data by running queries over those data sources. We call such queries that contain data (tables) that span across different systems Cross-Database Queries (CDQ). When it comes to optimize such queries, new challenges are introduced.

State-of-the-Art Cross Database Query Execution

There are already systems that support cross-database queries. For example, Spark SQL or Facebook's Presto provide a distributed query execution engine that is suitable for tasks like that. Usually, a query in such a system is executed as follows:

Fetch the required tables from the external systems
Distributed the tables in partitions across the cluster
Execute the query in the system's query execution engine

However, following that approach we lose a lot of potential optimization opportunities. The main one, is that we treat the external databases as simple storage engine, completely ignoring they query execution capabilities (join strategies, indexes, etc).

The main idea of that project is to create a query compiler that will be able such optimization opportunities. Instead of just treating the external database systems as storage engines that keep data, the proposed query compiler will be able to identify parts of the query that can be grouped together and be pushed-down to the external system for local-execution. This way, we are taking advantage of the external system's capabilities, and we may also reduce network transfers.

A big part of the aforementioned already exists. The possible directions that will be followed are the following two:

Exploration of cross-database join strategies that will include code generation for the possible cross-database query operators.
A reinforcement-learning component that will empirically take decisions about which and how large joins should push-down for local execution.

How will you do it? The implementation of the system will be over Spark SQL. Spark SQL, using Dataframes gives us a unified query optimizer that is makes it really easy to parse and execute queries that contain tables that span over multiple systems. The development/experimental setup will consist of a single-node Spark SQL instance, once Postgres and one MySQL instance. Currently, the main implementation consists of a Scala component that extracts the query graph information given an SQL query. The optimizer then generates a cross-database query plan in a tree-like structure, given that graph. The executor then generates SQL code from that intermediate representation that is going to be pushed-down to each system for local execution.

How will you empirically measure success? To measure success, some of the state-of-the-art database benchmarks will be used, including TPC-H, TPC-DS and the Join-Order-Benchmark (JOB). We will run the queries over the pure Spark SQL implementation and will compare the execution times to the ones achieved by the proposed cross-database query optimizer.

Team members:

me
*myself=&me
*i=myself

sampsyo commented 2 years ago

All sounds good! As soon as possible, please try to narrow down which of the above directions you will pick. You have several ideas in here, and you'll need to pick one or two to make the project accomplishable. Do you think you can do that within a few days?

I also think it would be good to be a little more specific about what numbers you will measure. You mentioned you:

will compare the execution times to the ones achieved by the proposed cross-database query optimizer

And comparing against a naive baseline seems good. However, you will also need to measure the specific impact of the optimizations you build on top of your framework. For example, if you build the optimized joins, you will want to compare a version of your system with the optimization against a version of your system without the optimized joins.

gsvic commented 2 years ago

All sounds good! As soon as possible, please try to narrow down which of the above directions you will pick. You have several ideas in here, and you'll need to pick one or two to make the project accomplishable. Do you think you can do that within a few days?

I also think it would be good to be a little more specific about what numbers you will measure. You mentioned you:

will compare the execution times to the ones achieved by the proposed cross-database query optimizer

And comparing against a naive baseline seems good. However, you will also need to measure the specific impact of the optimizations you build on top of your framework. For example, if you build the optimized joins, you will want to compare a version of your system with the optimization against a version of your system without the optimized joins.

Sure! I am doing some research in order to narrow this down, so I am reading a little bit of related work.

And yes, regarding the evaluation I imagine a comparison matrix as the one that you suggest, which will contain the execution times of 1. Spark SQL, 2. My current prototype 3. My updated prototype with the improvements for the project

sampsyo commented 2 years ago

Sounds good!

sampsyo / cs6120

Project Proposal: A Cross-Database Query Compiler #308

Query Compilers and Query Optimization

Traditional Database Query Optimization

Cross-Database Query Optimization

State-of-the-Art Cross Database Query Execution