Closed gsvic closed 1 year ago
All sounds good! As soon as possible, please try to narrow down which of the above directions you will pick. You have several ideas in here, and you'll need to pick one or two to make the project accomplishable. Do you think you can do that within a few days?
I also think it would be good to be a little more specific about what numbers you will measure. You mentioned you:
will compare the execution times to the ones achieved by the proposed cross-database query optimizer
And comparing against a naive baseline seems good. However, you will also need to measure the specific impact of the optimizations you build on top of your framework. For example, if you build the optimized joins, you will want to compare a version of your system with the optimization against a version of your system without the optimized joins.
All sounds good! As soon as possible, please try to narrow down which of the above directions you will pick. You have several ideas in here, and you'll need to pick one or two to make the project accomplishable. Do you think you can do that within a few days?
I also think it would be good to be a little more specific about what numbers you will measure. You mentioned you:
will compare the execution times to the ones achieved by the proposed cross-database query optimizer
And comparing against a naive baseline seems good. However, you will also need to measure the specific impact of the optimizations you build on top of your framework. For example, if you build the optimized joins, you will want to compare a version of your system with the optimization against a version of your system without the optimized joins.
Sure! I am doing some research in order to narrow this down, so I am reading a little bit of related work.
And yes, regarding the evaluation I imagine a comparison matrix as the one that you suggest, which will contain the execution times of 1. Spark SQL, 2. My current prototype 3. My updated prototype with the improvements for the project
Sounds good!
What will you do?
Query Compilers and Query Optimization
Traditional Database Query Optimization
Typically, a query compiler translates an input SQL query into the optimal query execution plan that will be executed on the machine, and it will collect all the results to send them back to the query. Usually, this consists of the following steps:
Cross-Database Query Optimization
Nowadays, it is common to have data stored in multiple data sources. Users, need a way to integrate data by running queries over those data sources. We call such queries that contain data (tables) that span across different systems Cross-Database Queries (CDQ). When it comes to optimize such queries, new challenges are introduced.
State-of-the-Art Cross Database Query Execution
There are already systems that support cross-database queries. For example, Spark SQL or Facebook's Presto provide a distributed query execution engine that is suitable for tasks like that. Usually, a query in such a system is executed as follows:
However, following that approach we lose a lot of potential optimization opportunities. The main one, is that we treat the external databases as simple storage engine, completely ignoring they query execution capabilities (join strategies, indexes, etc).
The main idea of that project is to create a query compiler that will be able such optimization opportunities. Instead of just treating the external database systems as storage engines that keep data, the proposed query compiler will be able to identify parts of the query that can be grouped together and be pushed-down to the external system for local-execution. This way, we are taking advantage of the external system's capabilities, and we may also reduce network transfers.
A big part of the aforementioned already exists. The possible directions that will be followed are the following two:
How will you do it? The implementation of the system will be over Spark SQL. Spark SQL, using Dataframes gives us a unified query optimizer that is makes it really easy to parse and execute queries that contain tables that span over multiple systems. The development/experimental setup will consist of a single-node Spark SQL instance, once Postgres and one MySQL instance. Currently, the main implementation consists of a Scala component that extracts the query graph information given an SQL query. The optimizer then generates a cross-database query plan in a tree-like structure, given that graph. The executor then generates SQL code from that intermediate representation that is going to be pushed-down to each system for local execution.
How will you empirically measure success? To measure success, some of the state-of-the-art database benchmarks will be used, including TPC-H, TPC-DS and the Join-Order-Benchmark (JOB). We will run the queries over the pure Spark SQL implementation and will compare the execution times to the ones achieved by the proposed cross-database query optimizer.
Team members:
me
*myself=&me
*i=myself