paradedb / pg_analytics

DuckDB-powered analytics for Postgres
https://paradedb.com
PostgreSQL License
197 stars 13 forks source link

feat: Implement caching #30

Closed rebasedming closed 1 month ago

rebasedming commented 2 months ago

Ticket(s) Closed

What

CREATE FOREIGN TABLE now has a cache option. If set to true, a DuckDB table instead of view is created. This stores the table in DuckDB format inside the Postgres data directory.

CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;

-- Provide S3 credentials
CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;

-- Create foreign table with auto schema creation
CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet', cache 'true');

-- Success! Now you can query the remote Parquet file like a regular Postgres table
SELECT COUNT(*) FROM trips;

Why

Enable caching for faster queries, sets us up for upserts later on.

How

Every Postgres database now creates a DuckDB database. Connections are opened against this database.

Tests

neilyio commented 2 months ago

Now that the DuckDB database is on disk, how does this work with multiple Postgres connection processes accessing it simultaneously? My impression is that DuckDB needs to be in read-only mode to support multiple connections:

https://github.com/duckdb/duckdb/issues/1343 https://github.com/duckdb/duckdb/issues/40

If you feel like it's appropriate to add a test for this, you can make multiple instances of PgConnection in the test suite and try running queries with both.

rebasedming commented 2 months ago

Now that the DuckDB database is on disk, how does this work with multiple Postgres connection processes accessing it simultaneously? My impression is that DuckDB needs to be in read-only mode to support multiple connections:

duckdb/duckdb#1343 duckdb/duckdb#40

If you feel like it's appropriate to add a test for this, you can make multiple instances of PgConnection in the test suite and try running queries with both.

Great catch. I had missed this caveat but I now see it https://duckdb.org/docs/connect/concurrency.

Do you think a background worker would solve?