mkitzan / constexpr-sql

Header only library that parses and plans SQL queries at compile time
MIT License
139 stars 6 forks source link
constexpr metaprogramming sql sql-parser template-metaprogramming

Constexpr SQL

A light weight single header alternative to DBMS

This library was developed during my honors project at the University of Victoria under the supervision of Bill Bird. The original development occurred in this Metaprogramming Optimization repository, but was moved into a new, dedicated, home repository. The project was inspired and influenced by Hana Dusíková's great Compile Time Regular Expressions library (CTRE).

Maintenance may slow in the near future due to my employer's open source software policy. I will be looking into getting approval to continue maintaining this project.

Library Features and Compiler Support

Supported features:

Unsupported features (future work):

As of April 2020, Constexpr SQL is only supported by GCC 9.0+. The compiler support is constrained because of the widespread use of the new C++20 feature "Class Types in Non-Type Template Parameters" (proposal P0732R2) which is only implemented by GCC 9.0+. Library users specify SQL queries and column labels with string literals which are converted into constexpr objects all of which relies on functionality from P0732R2.

Example

The following example shows usage of all class templates a user is expected to define to use the library.

#include <iostream>
#include <string>

#include "sql.hpp"

using books =
    sql::schema<
        "books", sql::index<"title">,
        sql::column<"title", std::string>,
        sql::column<"genre", std::string>,
        sql::column<"year", unsigned>,
        sql::column<"pages", unsigned>
    >;

using authored =
    sql::schema<
        "authored", sql::index<>,
        sql::column<"title", std::string>,
        sql::column<"name", std::string>
    >;

using query =
    sql::query<
        "SELECT title AS book, name AS author, year, pages "
        "FROM books NATURAL JOIN (SELECT * FROM authored WHERE name = \"Harlan Ellison\") "
        "WHERE year = 1967 OR year >= 1972 AND genre = \"science fiction\"",
        books, authored
    >;

int main()
{
    authored a{ sql::load<authored>("tests/data/authored.tsv", '\t') };
    books b{ sql::load<books>("tests/data/books.tsv", '\t') };

    for (query q{ b, a }; auto const& [book, author, year, pages] : q)
    {
        std::cout << book << '\t' << author << '\t' << year << '\t' << pages << '\n';
    }

    return 0;
}

sql::schema defines a relation used in a query. sql::index defines how an sql::schema sorts its data (unsorted if unspecified). sql::column types are used to define the rows in an sql::schema. sql::query wraps a query statement and the sql::schema types the query will operate on. sql::load can be used to load data from a file into an sql::schema.

The example is from example.cpp in the root of the repository, and can be compiled and executed with the following command:

g++ -std=c++2a -pedantic -Wall -Wextra -Werror -O3 -Isingle-header/ -o example example.cpp && ./example

It is strongly recommended to compile with optimizations enabled, otherwise expect template bloat. Use of multiple objects of the same sql::query type is considered undefined behavior (due to issues involving static members). Instantiating sql::query objects should be performed within a guarded scope, like in the example. However, there are no use restrictions to sql::schema types. sql::schema types may be used multiple times within a single query or in many queries at once. There are more examples and information in presentation.pdf at the root of the repository.

Correctness and Performance Testing

The library has a significant testing system which is composed of two script pipelines. All tests use the data from another project of mine called Terminus which is a library database shell. The correctness testing pipeline generates nearly 1.5 million test queries, then Constexpr SQL's output is compared against the output of SQLite3 performing the same queries. The performance testing pipeline executes six different SQL queries implemented using Constexpr SQL and hand coded SQL. The queries are executed over 65 thousand times (256 for CROSS JOIN due to computational complexity), and the execution timing is captured using the Linux time tool.

The runner.sh script in the tests directory will execute correctness testing, and the runner.sh script in tests/perf will execute performance testing.

Important Class Templates and Implementation Details

The following sections provide a high-level description about how the library is implemented. Hopefully, the sections will provide useful code and document references to others looking to write similar libraries.

Class Template: sql::schema

The sql::schema class template represents relational schemas and, when instantiated, SQL tables. The class template is parameterized on three template parameters: Name, Index, and Col template parameter pack. Name defines the SQL table name which is matched against table names in a query's FROM statement. The Index template argument is used to support GROUP BY statements by using SFINAE to select the underlying column data container (std::vector or std::multiset). The Index template argument, when fully specified, provides the comparator functor used by the std::multiset container. The Cols template parameter pack is expanded into the sql::row type for the schema. sql::schema objects support structured binding declarations which is facilitated partly through the sql::schema API and partly through std namespace injections from sql::row helping to satisfy the argument dependant lookup of the get<i> function.

Reference the example earlier for proper usage of sql::schema. Notice in the example the string literal as a template argument. String literals are lvalue reference types which are passed as const pointers. Normally, pointers can not be used as template arguments. With the new C++20 feature mentioned earlier, a cexpr::string constructor template can be deduced to turn the string literal into a constexpr object. The deduction is enabled through cexpr::string's class template argument deduction guide which provides a mapping of constructor arguments to template parameters.

Class Template: sql::query

The sql::query class template is the user interface to the SQL query parser. The class is templated on a cexpr::string object (the SQL query) and a template parameter pack of sql::schema types. At compile time, the SQL query string is parsed into the relational algebra expression tree representing the query's computation. The constructor to a fully specified sql::query class takes a variadic pack of sql::schema objects which it uses to seed the relational algebra expression tree with iterators to data. The sql::query object can then be used in a range loop with structured binding declarations like in the example.

The relational algebra expression tree uses static members to hold data, so only one object of a single fully specified sql::query class can exist at once in the program. To ensure this the object should be constructed within a guarded scope like in the example. It's worth noting that even though this class template's source file is the largest among the code base, nearly all of it is only live during compilation to parse the SQL query. In fact, the runtime interface is deliberately insubstantial, merely providing an wrapper to support range loops and structured binding declarations.

In compliance with range loop syntax, sql::query has an associated iterator class sql::query_iterator. sql::query_iterator wraps the type representing the relational algebra expression and handles all of the idiosyncrasies of its usage in favor of the familiar forward iterator interface. When an sql::query_iterator is dereferenced, it returns a constant reference to an sql::row object representing the current row of output from the query stream.

Class Template: sql::row

The sql::row class template is a template recursive linked-list (similar to std::tuple). A template recursive linked-list is a template metaprogramming pattern which expresses a type analogous to a traditional linked-list. sql::row implements this pattern with two template parameters Col and Next. Col represents the sql::column which the node in list holds a data element from. Next represents the type of the next node in the list, which is either another sql::row type or sql::void_row. Because the next node is expressed as a type the template linked-list does not incur the overhead of holding a next pointer nor the run time cost of dereferencing a pointer to iterate (also makes better use of the cache). A quirk to this pattern is that the node data type need not be homogenous across the list, instead the list may be composed of heterogenous data types. Also, template linked-list access is computed at compile time, so the run time cost is constant.

The example does not demonstrate the sql::get<cexpr::string> helper function. This helper function allows the user to query a specific element from an sql::row object by column name. Additionally, sql::row has ML-like functions for getting the head and tail of the object.

Relational Algebra Expression Nodes

At the moment, ra::projection, ra::rename, ra::cross, ra::natural, ra::selection, and ra::relation are the only relational algebra nodes implemented. ra::projection and ra::rename are unary operators which take a single sql::row from their Input relational algebra operator and fold their operation over the row before propagating the transformed row to their Output. The fold is implemented as a template recursive function. ra::cross outputs the cross product of two relations. ra::natural implements a natural join between two relations using a hash table buffer of the right relation for performance. ra::selection uses a predicate function constructed from a WHERE clause to filter rows in a query. ra::relation is the only terminal node in the expression tree which is used for retrieving the next input in the stream. These operators are composable types and are used to serialize the relational algebra expression tree. Individual objects of each type are not instantiated to compose the expression tree. Instead to ensure the expression tree is a zero overhead abstraction, the types implement a static member function next used to request data from its input type. The actual constexpr template recursive recursive descent SQL parser will serialize these individual nodes together into the appropriate expression tree.

Constexpr Parsing

As a proof of concept for constexpr parsing, two math expression parsers were implemented in old repository: cexpr::prefix and cexpr::infix. cexpr::prefix demonstrates the fundamental method of constexpr parsing an expression tree into a class template. cexpr::infix extends this to perform constepxr recursive descent parsing. cexpr::infix and the SQL query parser are a whole order of magnitude more complex, because there's recursive function template instantiations to many different function templates. The explanation of constexpr parsing is illustrated through cexpr::prefix for simplicity.

The expression tree created while parsing is a template recursive tree which shares similar properties to the template linked-list (discussed earlier). A notable benefit to this data structure is that because the tree is composed of types rather than data values, the tree can be used to express computation models (expression trees) rather than just a tree based container.

Fundamentally, the parsing is accomplished by calling a template recursive static constexpr parsing function member parameterized on the token position which the parser's "cursor" is standing on (Pos template parameter). At each "cursor" position, the function uses the token to decide the how to proceed. If the token indicates the start of an operation, the parser recurses the immediate left subexpression ("cursor" + 1). On return, the left subexpression will report the depth in the token stream it recursed at which point the right subexpression will pick up at this position. This control flow is expressed in this line of cexpr::prefix::parse. Once both left and right subexpressions are parsed, the new node's type is formed (decltype left and right subexpressions) which is then propagated to the caller. Otherwise, if the token indicates a terminal, then an appropriate terminal node is constructed. It is necessary that the "cursor" position is unique across template instantiations, otherwise template memoization will lead to "infinite" recursion.

The few ancillary class templates used to support this parsing and the math node struct templates can be found in the templ namespace of the old repository. There is also a driver program using the parsers in the old repository. In the Constexpr SQL parser, all of the entities in the templ namespace were replaced for more template metaprogramming idiomatic structures.