substrait-io / duckdb-substrait-extension

MIT License
28 stars 22 forks source link

Substrait - DuckDB

Substrait - DuckDB is an extension that provides substrait support to DuckDB. The main goal of this extension is to support a both production and consumption of substrait query plans in DuckDB.

To build, type

make

To run, run the bundled duckdb shell:

 ./build/release/duckdb 

Then, load the Substrait - DuckDB extension like so:

LOAD 'build/release/substrait.duckdb_extension';

Support

This extension is mainly supported in 3 different APIs. 1) The SQL API, 2) The Python API, 3) The R API. Here we depict how to consume and produce substrait query plans in each API.

SQL

In the SQL API, users can generate substrait plans (into a blob or a JSON) and consume substrait plans.

Before using the extension, you must always properly install and load it. To install and load the released version of the substrait library, you must execute the following SQL commands.

INSTALL substrait;
LOAD substrait;

1) Blob Generation

 To generate a substrait blob the ```get_substrait(SQL)``` function must be called with a valid SQL select query.
 ```sql
 CREATE TABLE crossfit (exercise text,difficulty_level int);
 INSERT INTO crossfit VALUES ('Push Ups', 3), ('Pull Ups', 5) , (' Push Jerk', 7), ('Bar Muscle Up', 10);

 CALL get_substrait('select count(exercise) as exercise from crossfit where difficulty_level <=5');
 ----
 \x12\x09\x1A\x07\x10\x01\x1A\x03lte\x12\x11\x1A\x0F\x10\x02\x1A\x0Bis_not_null\x12\x09\x1A\x07\x10\x03\x1A\x03and\x12\x10\x1A\x0E\x10\x04\x1A\x0Acount_star\x1A\xCB\x01\x12\xC8\x01\x0A\xBB\x01:\xB8\x01\x12\xAB\x01"\xA8\x01\x12\x97\x01\x0A\x94\x01\x12.\x0A\x08exercise\x0A\x0Fdifficulty_level\x12\x11\x0A\x07\xB2\x01\x04\x08\x0D\x18\x01\x0A\x04*\x02\x10\x01\x18\x02\x1AJ\x1AH\x08\x03\x1A\x04\x0A\x02\x10\x01""\x1A \x1A\x1E\x08\x01\x1A\x04*\x02\x10\x01"\x0C\x1A\x0A\x12\x08\x0A\x04\x12\x02\x08\x01"\x00"\x06\x1A\x04\x0A\x02(\x05"\x1A\x1A\x18\x1A\x16\x08\x02\x1A\x04*\x02\x10\x01"\x0C\x1A\x0A\x12\x08\x0A\x04\x12\x02\x08\x01"\x00"\x0A\x0A\x06\x0A\x02\x08\x01\x0A\x00\x10\x01:\x0A\x0A\x08crossfit\x1A\x00"\x0A\x0A\x08\x08\x04*\x04:\x02\x10\x01\x1A\x08\x12\x06\x0A\x02\x12\x00"\x00\x12\x08exercise
```

2) Json Generation

 To generate a json representing  the substrait plan the ```get_substrait_json(SQL)``` function must be called with a valid SQL select query.
 ```sql
 CALL get_substrait_json('select count(exercise) as exercise from crossfit where difficulty_level <=5');
 ----
 {"extensions":[{"extensionFunction":{"functionAnchor":1,"name":"lte"}},{"extensionFunction":{"functionAnchor":2,"name":"is_not_null"}},{"extensionFunction":{"functionAnchor":3,"name":"and"}},{"extensionFunction":{"functionAnchor":4,"name":"count_star"}}],"relations":[{"root":{"input":{"project":{"input":{"aggregate":{"input":{"read":{"baseSchema":{"names":["exercise","difficulty_level"],"struct":{"types":[{"varchar":{"length":13,"nullability":"NULLABILITY_NULLABLE"}},{"i32":{"nullability":"NULLABILITY_NULLABLE"}}],"nullability":"NULLABILITY_REQUIRED"}},"filter":{"scalarFunction":{"functionReference":3,"outputType":{"bool":{"nullability":"NULLABILITY_NULLABLE"}},"arguments":[{"value":{"scalarFunction":{"functionReference":1,"outputType":{"i32":{"nullability":"NULLABILITY_NULLABLE"}},"arguments":[{"value":{"selection":{"directReference":{"structField":{"field":1}},"rootReference":{}}}},{"value":{"literal":{"i32":5}}}]}}},{"value":{"scalarFunction":{"functionReference":2,"outputType":{"i32":{"nullability":"NULLABILITY_NULLABLE"}},"arguments":[{"value":{"selection":{"directReference":{"structField":{"field":1}},"rootReference":{}}}}]}}}]}},"projection":{"select":{"structItems":[{"field":1},{}]},"maintainSingularStruct":true},"namedTable":{"names":["crossfit"]}}},"groupings":[{}],"measures":[{"measure":{"functionReference":4,"outputType":{"i64":{"nullability":"NULLABILITY_NULLABLE"}}}}]}},"expressions":[{"selection":{"directReference":{"structField":{}},"rootReference":{}}}]}},"names":["exercise"]}}]}
 ```

3) Blob Consumption

 To consume a substrait blob the ```from_substrait(blob)``` function must be called with a valid substrait BLOB plan.
 ```sql
 CALL from_substrait('\x12\x07\x1A\x05\x1A\x03lte\x12\x11\x1A\x0F\x10\x01\x1A\x0Bis_not_null\x12\x09\x1A\x07\x10\x02\x1A\x03and\x12\x10\x1A\x0E\x10\x03\x1A\x0Acount_star\x1A\xA4\x01\x12\xA1\x01\x0A\x94\x01:\x91\x01\x12\x86\x01"\x83\x01\x12y:w\x12c\x12a\x12+\x0A)\x12\x1B\x0A\x08exercise\x0A\x0Fdifficulty_level:\x0A\x0A\x08crossfit\x1A2\x1A0\x08\x02"\x18\x1A\x16\x1A\x14"\x0A\x1A\x08\x12\x06\x0A\x04\x12\x02\x08\x01"\x06\x1A\x04\x0A\x02(\x05"\x12\x1A\x10\x1A\x0E\x08\x01"\x0A\x1A\x08\x12\x06\x0A\x04\x12\x02\x08\x01\x1A\x08\x12\x06\x0A\x04\x12\x02\x08\x01\x1A\x06\x12\x04\x0A\x02\x12\x00\x1A\x00"\x04\x0A\x02\x08\x03\x1A\x06\x12\x04\x0A\x02\x12\x00\x12\x08exercise'::BLOB);
 ----
 2

#### Controlling Query Optimization

The `get_substrait(SQL)` and `get_substrait_json(SQL)` functions accept an optional parameter, `enable_optimizer`,
to explicitly enable or disable query optimization when generating Substrait:

```sql
CALL get_substrait('select count(exercise) as exercise from crossfit', enable_optimizer=false);
CALL get_substrait_json('select count(exercise) as exercise from crossfit', enable_optimizer=true);

If enable_optimizer is not specified, it is inferred from the connection-level settings: if query optimization is disabled at the connection level (e.g. using PRAGMA disable_optimizer), the Substrait generation functions will not optimize the query; otherwise, they will.

If any specific optimizers are disabled at the connection level (e.g. using SET disabled_optimizers TO '...'), they will also be disabled when generating Substrait.

The from_substrait(blob) function always respects the connection-level settings when deciding whether to optimize a Substrait plan before executing it.

Python

Before using the extension you must remember to properly load it. To load an extension in python, you must execute the sql commands within a connection.

import duckdb

con = duckdb.connect()
con.install_extension("substrait")
con.load_extension("substrait")

[!TIP] See Controlling Query Optimization for more information on how to enable or disable the optimizer when generating Substrait. The Substrait generation functions below support an enable_optimizer=bool keyword argument for convenience.

1) Blob Generation

 To generate a substrait blob the ```get_substrait(SQL)``` function must be called, from a connection, with a valid SQL select query.
 ```python
 con.execute(query='CREATE TABLE crossfit (exercise text,difficulty_level int);')
 con.execute(query="INSERT INTO crossfit VALUES ('Push Ups', 3), ('Pull Ups', 5) , (' Push Jerk', 7), ('Bar Muscle Up', 10);")

 proto_bytes = con.get_substrait(query="select count(exercise) as exercise from crossfit where difficulty_level <=5").fetchone()[0]
2) Json Generation

     To generate a json representing  the substrait plan the ```get_substrait_json(SQL)``` function, from a connection, must be called with a valid SQL select query.
     ```python
     json =  con.get_substrait_json("select count(exercise) as exercise from crossfit where difficulty_level <=5").fetchone()[0]

3) Blob Consumption

 To consume a substrait blob the ```from_substrait(blob)``` function must be called, from the connection, with a valid substrait BLOB plan.
 ```python
 query_result = con.from_substrait(proto=proto_bytes)
```

R

Before using the extension you must remember to properly load it. To load an extension in R, you must execute the sql commands within a connection.

con <- dbConnect(duckdb::duckdb(config=list("allow_unsigned_extensions"="true")))
dbExecute(con, "LOAD('substrait')")
dbExecute(con, "INSTALL('substrait')"))

[!TIP] See Controlling Query Optimization for more information on how to enable or disable the optimizer when generating Substrait. The Substrait generation functions below support an enable_optimizer=bool keyword argument for convenience.

1) Blob Generation

 To generate a substrait blob the ```duckdb_get_substrait(con,SQL)``` function must be called, with a connection and a valid SQL select query.
 ```r
 dbExecute(con, "CREATE TABLE crossfit (exercise text,difficulty_level int);")
 dbExecute(con, "INSERT INTO crossfit VALUES ('Push Ups', 3), ('Pull Ups', 5) , (' Push Jerk', 7), ('Bar Muscle Up', 10);")

 proto_bytes <- duckdb::duckdb_get_substrait(con, "select * from integers limit 5")    
2) Json Generation

     To generate a json representing  the substrait plan  ```duckdb_get_substrait_json(con,SQL)``` function, with a connection and a valid SQL select query.
     ```r
     json <- duckdb::duckdb_get_substrait_json(con, "select count(exercise) as exercise from crossfit where difficulty_level <=5")

3) Blob Consumption

 To consume a substrait blob the ```duckdb_prepare_substrait(con,blob)``` function must be called, with a connection and a valid substrait BLOB plan.
 ```r
  result <- duckdb::duckdb_prepare_substrait(con, proto_bytes)
  df <- dbFetch(result)
```

Setting up CLion

Configuring CLion with the extension template requires a little work. Firstly, make sure that the DuckDB submodule is available. Then make sure to open ./duckdb/CMakeLists.txt (so not the top level CMakeLists.txt file from this repo) as a project in CLion. Now to fix your project path go to tools->CMake->Change Project Root(docs) to set the project root to the root dir of this repo.

Now to configure the build targets, copy the CMake variables specified in the Makefile and ensure the build directory is set to ../build/<build_mode>.

Updating the Substrait version

Use the script in scripts/update_substrait.py to update the substrait version. It requires protoc 3.19.4, GitHub, and Python. To update the substrait code, simply change the git commit tag in the script to the desired substrait release version. Then, execute the script by running python scripts/update_substrait.py.