kestra-io / plugin-jdbc

https://kestra.io/plugins/
Apache License 2.0
10 stars 7 forks source link

DuckDB - Access or create a persistent database #159

Open armetiz opened 11 months ago

armetiz commented 11 months ago

Feature description

By default, DuckDB start with an in-memory database.

To avoid out-of-memory, it could be useful to connect DuckDB to a database file.

From Java documentation :

When using the jdbc:duckdb: URL alone, an in-memory database is created. Note that for an in-memory database no data is persisted to disk (i.e., all data is lost when you exit the Java program). If you would like to access or create a persistent database, append its file name after the path. For example, if your database is stored in /tmp/my_database, use the JDBC URL jdbc:duckdb:/tmp/my_database to create a connection to it.

armetiz commented 11 months ago

On MBP, using DuckDB 0.9.0 with an in-memory database, I tried to fetch a large data-set.

Here the DuckDB error :

Error: near line 1: Out of Memory Error: could not allocate block of size 262KB (27.4GB/27.4GB used)
Database is launched in in-memory mode and no temporary directory is specified.
Unused blocks cannot be offloaded to disk.

Launch the database with a persistent storage back-end
Or set PRAGMA temp_directory='/path/to/tmp.tmp'

IMHO, Using an in-memory database with setting temp_directory is adapted to a stateless task. This should be the case by default.

Whereas using DuckDB with a persistent storage back-end could be useful only if it could be "re-used" between tasks. This should be a Kestra option.

I mean something like that. Tasks :

  1. Create and import SQL table - echo "CREATE TABLE t1 AS SELECT 42 AS i, 84 AS j;" | duckdb database.file
  2. Export analyze - echo "COPY t1 TO 'output.parquet' (FORMAT PARQUET)" | duckdb database.file

It could be useful because SQL operations could be split between dedicated task, to improve debug, maintenance, readability ...