paradedb / pg_analytics

DuckDB-powered analytics for Postgres
https://paradedb.com
GNU Affero General Public License v3.0
139 stars 9 forks source link

Cache Parquet metadata in `pg_analytics` #57

Open philippemnoel opened 1 month ago

philippemnoel commented 1 month ago

What feature are you requesting?

DuckDB has a feature, enable_object_cache, which can cache Parquet metadata in-memory. This is known to improve query performance. We would like to cache this on-disk, so that it can persist across queries.

Why are you requesting this feature?

See above ^

What is your proposed implementation for this feature?

I'm not sure, see https://github.com/paradedb/pg_analytics/pull/30 for reference

Full Name:

Philippe Noël

Affiliation:

ParadeDB

dharanad commented 1 month ago

interesting. I would like to work on this issue. Before that let me take a look at the codebase.

philippemnoel commented 1 month ago

It’s yours!

dharanad commented 1 month ago

Hey @philippemnoel I am finding it hard to allocate time for this issue. Since this is marked as high priority. Please feel free to re-assign this issues to someone else

philippemnoel commented 1 month ago

Hey @philippemnoel I am finding it hard to allocate time for this issue. Since this is marked as high priority. Please feel free to re-assign this issues to someone else

Sounds good!

shamb0 commented 1 month ago

Hi @philippemnoel, I am interested to work on this. Could you please assign it to me.

philippemnoel commented 1 month ago

Yes

philippemnoel commented 1 month ago

Hi @philippemnoel, I am interested to work on this. Could you please assign it to me.

Done! Excited to see :)

shamb0 commented 3 weeks ago

Hi @philippemnoel,

Could you please reassign the issue to me?

2024-09-02 22:27:30.592  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Completed data upload to S3
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: table "auto_sales" does not exist, skipping
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: server "auto_sales_server" does not exist, skipping
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: foreign-data wrapper "parquet_wrapper" does not exist, skipping
2024-09-02 22:27:30.594  INFO sqlx::postgres::notice: server "auto_sales_server" does not exist, skipping
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time with cache disabled: 479.805658ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time with cache enabled: 272.384816ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time after disabling cache: 476.185084ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Performance improvement with cache: 42.80%
test test_duckdb_object_cache_performance ... ok

Thank you!

philippemnoel commented 3 weeks ago

Hi @philippemnoel,

Could you please reassign the issue to me?

  • I've completed a draft version, but this implementation depends on PR#115, so I’m waiting for it to be reviewed and merged before proceeding further.
  • You can find the draft prototype here: link to draft prototype.
  • Please kindly find the traces for reference.
2024-09-02 22:27:30.592  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Completed data upload to S3
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: table "auto_sales" does not exist, skipping
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: server "auto_sales_server" does not exist, skipping
2024-09-02 22:27:30.593  INFO sqlx::postgres::notice: foreign-data wrapper "parquet_wrapper" does not exist, skipping
2024-09-02 22:27:30.594  INFO sqlx::postgres::notice: server "auto_sales_server" does not exist, skipping
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time with cache disabled: 479.805658ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time with cache enabled: 272.384816ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Average execution time after disabling cache: 476.185084ms
2024-09-02 22:27:43.928  INFO test_mlp_auto_sales::fixtures::tables::auto_sales: Performance improvement with cache: 42.80%
test test_duckdb_object_cache_performance ... ok

Thank you!

Done! Really exciting stuff, I had removed you since you were still working on the partition PR. Excited to see it!