The development for this project is now done at https://github.com/feldera/dbsp.
This repository holds the source code for a compiler translating SQL view definitions into DBSP circuits. DBSP is a framework for implementing incremental, streaming, (and non-streaming) queries. DBSP is implemented in Rust in the repository https://github.com/vmware/database-stream-processor
The SQL compiler is based on the Apache Calcite compiler infrastructure https://calcite.apache.org/
The code has been tested on Windows 10/11 and Linux.
You need Java 8 or later to build the compiler, and the maven (mvn) Java build program. Maven will take care of installing all required Java dependencies.
The code generated by the compiler is Rust. To run it you need a working installation of Rust (we recommend using rustup to install: https://www.rust-lang.org/tools/install). The compiler uses the current version of the DBSP library sources from github (https://github.com/vmware/database-stream-processor).
The testing programs use sqllogictest -- see the section on testing
If you want to generate images of the query plans you need to also install graphviz as described here https://graphviz.org/download/.
If you get Rust compilation errors you should try to make sure you have the latest version of the Rust libraries and toolchain:
$ rustup update
$ cd temp
$ cargo update
(temp
is the directory where the tests write the generated Rust code.)
Documentation is under development. Generating the documentation requires Python, and make. The documentation can be generated using the following command:
$ cd doc
$ make html
(The first time you attempt to generate the documentation it will
attempt to install additional dependencies, such as Sphinx; you may
need to delete the Python-created venv-docs
directory after this
installation to be able to properly generate html. The directory will
be regenerated.)
The result is produced in doc/_build/html/index.html
.
To run the tests:
$ cd SQL-compiler
$ ./run-tests.sh
Beware that the full sql logic tests can run for a few weeks, there are more than 7 million of them! Most of the time is spent compiling Rust, hopefully we'll be able to speed that up at some point.
The DBSP runtime is optimized for performing incremental view maintenance. In consequence, DBSP programs in SQL are expressed as VIEWS, or standing queries. A view is a function of one or more tables and other views.
For example, the following query defines a view:
CREATE VIEW V AS SELECT * FROM T WHERE T.age > 18
In order to interpret this query the compiler needs to have been given a definition of table (or view) T. The table T should be defined using a SQL Data Definition Language (DDL) statement, e.g.:
CREATE TABLE T
(
name VARCHAR,
age INT,
present BOOLEAN
)
The compiler must be given the table definition first, and then the view definition. The compiler generates a Rust library which implements the query as a function: given the input data, it produces the output data.
The compiler can (optionally) generate a library which will
incrementally maintain the view V
when presented with changes to
table T
:
table changes
V
tables -----> SQL-to-DBSP compiler ------> DBSP circuit
views V
view changes
A compiler from SQL to DBSP is produced by the build system usign mvn -DskipTests package
. A one-line Linux shell script, called
sql-to-dbsp
, residing in the directory SQL-compiler
directory,
invokes the compiler. Here is an example:
$ ./sql-to-dbsp
Usage: sql-to-dbsp [options] Input file to compile
Options:
-h, --help, -
Show this message and exit
-O0
Do not optimize
Default: false
-alltables
Generate an input for each CREATE TABLE, even if the table is not used
by any view
Default: false
-d
SQL syntax dialect used
Default: ORACLE
Possible Values: [BIG_QUERY, ORACLE, MYSQL, MYSQL_ANSI, SQL_SERVER, JAVA]
-f
Name of function to generate
Default: circuit
-i
Generate an incremental circuit
Default: false
-j
Emit JSON instead of Rust
Default: false
-je
Emit error messages as a JSON array to stderr
Default: false
-jpg
Emit a jpg image of the circuit instead of Rust
Default: false
-o
Output file; stdout if null
$ ./sql-to-dbsp x.sql -o ../temp/src/lib.rs
The last command-line compiles a script called x.sql
and writes the
result in a file lib.rs
. Let's assume we are compiling the
following input file:
$ cat x.sql
-- example input file
CREATE TABLE T(COL0 INTEGER, COL1 INTEGER);
CREATE VIEW V AS SELECT T.COL1 FROM T;
The input file can contain comments, and only two kinds of SQL
statements, separated by semicolons: CREATE TABLE
and CREATE VIEW
.
In the generated DBSP circuit every CREATE TABLE
is translated to an
input, and every CREATE VIEW
is translated to an output. The result
produced will look like this:
$ cat ../temp/src/lib.rs
// Automatically-generated file
[...boring stuff removed...]
pub fn circuit(workers: usize) -> (DBSPHandle, Catalog) {
let mut catalog = Catalog::new();
let (circuit, handles) = Runtime::init_circuit(workers, |circuit| {
let map34: _ = move |t: &Tuple2<Option<i32>, Option<i32>>, | -> Tuple1<Option<i32>> {
Tuple1::new(t.1)
};
// CREATE TABLE `T` (`COL0` INTEGER, `COL1` INTEGER)
let (T, handle0) = circuit.add_input_zset::<Tuple2<Option<i32>, Option<i32>>, Weight>();
let stream38: Stream<_, OrdZSet<Tuple1<Option<i32>>, Weight>> = T.map(map34);
// CREATE VIEW `V` AS
// SELECT `T`.`COL1`
// FROM `T`
let handle1 = stream38.output();
(handle0,handle1,)
}).unwrap();
catalog.register_input_zset_handle("T", handles.0);
catalog.register_output_batch_handle("V", handles.1);
(circuit, catalog)
}
You can compile the generated Rust code:
$ cd ../temp
$ cargo build
The generated file contains a Rust function called circuit
(you can
change its name using the compiler option -f
). Calling circuit
will return an executable DBSP circuit handle, and a DBSP catalog.
These APIs can be used to execute the circuit. See the DBSP
documentation for more information.
TODO: add here an example invoking the circuit.
Compilation proceeds in several stages:
SqlNode
data typesRelNode
representationCalciteProgram
data structure, which packages together all the
views that are being compiled (multiple views can be maintained simultaneously)CalciteToDBSPCompiler
converts a CalciteProgram
data structure into a DBSPCircuit
data structure.CircuitOptimizer
class can optimize the generated circuit, optionally converting
it into an incremental circuit, which is expected to compute only on changes.circuit
can be serialized as Rust using the ToRustString
visitor.Unit tests are written using JUnit and test pointwise parts of the compiler.
They can be executed usign mvn test
.
One of the means of testing the compiler is using sqllogictests: https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki.
We have implemented a general-purpose testing framework for running
SqlLogicTest programs, in the org.dbsp.sqllogictest
package. The
framework parses SqlLogicTest files and creates an internal
representation of these files. The files are executed by "test
executors".
The tests are run by a standalone executable (the executable is
invoked by the run-tests.sh
script). The executable supports the
following command-line arguments:
Usage: [options] Files or directories with test data (relative to the specified directory)
Options:
-b
Load a list of buggy commands to skip from this file
-d
Directory with SLT tests
Default: ../../sqllogictest
-e
Executor to use; one of 'none, JDBC, calcite'
Default: none
-h
Show this help message and exit
Default: false
-i
Install the SLT tests if the directory does not exist
Default: false
-inc
Incremental testing
Default: false
-n
Do not execute, just parse the test files
Default: false
-s
Ignore the status of SQL commands executed
Default: false
-x
Stop at the first encountered query error
Default: false
We have multiple executors:
NoExecutor
test executorThis executor does not really run any tests. But it can still be used by the test loading mechanism to check that we correctly parse all SQL logic test files.
DBSPExecutor
The model of SQLLogicTest has to be adapted for testing using DBSP.
Since DBSP is not a database, but a streaming system, some SQL
statements are ignored (e.g., CREATE INDEX
) and some other
cannot be supported (e.g., CREATE UNIQUE INDEX
).
So a SqlLogicTest script is turned into multiple DBSP tests, each of which creates a circuit, feeds it one input, reads the output, and validates it, executing exactly one transaction. When testing incremental circuits the test code feeds multiple inputs and only checks the final output.
JDBC
executorThis executor parallels the standard ODBC executor written in C by sending the statements and queries to a database to be executed. Any database that supports JDBC and can handle the correct syntax of the queries can be used, but we use by default the HSQLDB http://hsqldb.org/ database.
DBSP_JDBC_Executor
This executor is a combination of the DBSP executor and the JDBC executor, using a real database to store data in tables, but using DBSP as a query engine. It should be able to execute all SqlLogicTest queries that are supported by the underlying database.
This executor uses the JDBC executor to execute the statements storing the data, and the default Calcite compiler settings to compile and execute the queries. This code has been contributed to the Calcite project.
The 'inc' column shows tests for incremental circuits, the other columns show tests for non-incremental (standard) implementations.
The matrix represents the current test results; the numbers indicate the passing/failing tests in each category. The failing test cases are detailed below.
Test group | DBSP | JDBC_DBSP | JDBC_DBSP inc |
---|---|---|---|
random/select | 1,120,329/0 | 1,120,329/0 | 1,120,329/0 |
random/groupby | 118,757/0 | 118,757/0 | 118,757/0 |
random/expr | 1,317,682/0 | 1,198,926/0 | 1,198,926/0 |
random/aggregates | 1,172,825/2 | 1,172,825/0 | 1,172,825/0 |
select1 | 1,000/0 | 1,000/0 | 1,000/0 |
select2 | 1,000/0 | 1,000/0 | 1,000/0 |
select3 | 3,320/0 | 3,320/0 | 3,320/0 |
select4 | 2,832/0 | 2,832/0 | 2,832/0 |
select5 | 732/0 | 732/0 | 732/0 |
index/delete | N/A | 40,525/0 | 40,525/0 |
index/in | N/A | 130,065/0 | 130,065/0 |
index/commute | N/A | 507,514/0 | 507,514/0 |
index/between | N/A | 121,811/0 | 121,811/0 |
index/orderby_nosort | N/A | 490,986/0 | 490,986/0 |
index/view | N/A | 53,490/0 | 53,490/0 |
index/random | N/A | 188,449/0 | 188,449/0 |
index/orderby | N/A | 310,630/0 | 310,630/0 |
evidence | N/A | 153/25 |
The "index" tests cannot be executed with the DBSPExecutor
since it
does not support the "unique index" SQL statement.
We are manually converting Postgres tests https://github.com/postgres/postgres/blob/master/src/test/regress/expected to run on DBSP, since the Postgres SQL syntax is too different from the Calcite syntax (especially in the functions supported). In some cases the semantics of Postgres SQL is different from Calcite; for example, in Postgres the days of the week are numbered from 0, where 0 is Sunday, but in Calcite the days of the week are numbered from 1, where 1 is Sunday. In such cases we have to change the tests.