[NOT-36] Weeknotes individual update: October 16, 2023

Postgres exports using DuckDB

We mentioned earlier that we are exploring Postgres streaming replication in the Basin project. Here, I want to discuss using Duck DB instead of streaming replication for the same purpose.

DuckDB is an in-memory analytical data management system. Similar to SQLite, it is an embedded database. However, DuckDB is further optimized for analytical workloads and can efficiently handle complex queries on large datasets. It is an excellent choice for performing complex queries on large volumes of data directly from within an application.

DuckDB has a thriving ecosystem of extensions. Here, we will look at two of those: postgres-scanner and parquet. Combining these two extensions, we can export table backups to a parquet file with only a few lines of code. Here is an example in Rust.

let conn = Connection::open_in_memory()?;

// Initialize the Parquet extension
let load_parquet_extension = "INSTALL parquet; LOAD parquet;"; 
conn.execute_batch(load_parquet_extension)?; 

// Initialize the Postgres extension
let load_pg_extension = "INSTALL postgres_scanner; LOAD postgres_scanner;"; conn.execute_batch(load_pg_extension)?;

// execute the postgres scan
let pg_query = "SELECT * from postgres_scan('dbname=mydb user=myuser password=password host=localhost port=5432', 'public', 'mytable')";
let copy_query = format!("Copy({}) TO 'fids.parquet' (FORMAT 'parquet');", pg_query); 

conn.query_row(copy_query.as_str(), [], |row: &Row<'_>| {
  println!("row: {:?}", row.get::<_, usize>(0)); Ok(()) 
}).unwrap();

At the current stage, Basin copies the data from a table and archives them in Filecoin. We could do the same using DuckDB extensions. However, there are tradeoffs involved here.

First, unlike the log-based replication approach, DuckDB's Postgres scanner does a table scan. For large tables, it could be expensive to run this operation regularly. Also, the cost will be incurred by the primary database server.

Second, the DuckDB table scan approach is well suited for simple use cases such as building a cache over a primary database. Building a long-running replication pipeline will take more work. For instance, it will require maintaining a row pointer at which the next table scan should start. Also, it may only help in replicating tables that are append-only.

_{From SyncLinear.com | NOT-36}

tablelandnetwork / weeknotes

[NOT-36] Weeknotes individual update: October 16, 2023 #43

Postgres exports using DuckDB