paradedb / pg_analytics

DuckDB-powered analytics for Postgres
https://paradedb.com
PostgreSQL License
383 stars 15 forks source link

test: Support for Multi-Level Partition Tables #88

Closed shamb0 closed 2 months ago

shamb0 commented 3 months ago

Closes #56

What

Implements a demonstration test for multi-level partition tables, addressing issue #56.

Why

This demo showcases the pg_analytics extension's capability to support multi-level partitioned tables. The implementation organizes data hierarchically, enabling efficient access to context-relevant information.

How

  1. Generates a simulated Auto Sales dataset and creates a local Parquet data source.
  2. Utilizes DataFusion to load sales data from the Parquet file into a DataFrame.
  3. Partitions the data and uploads partitions to an S3 bucket.
  4. Configures necessary tables and pg_analytics Foreign Data Wrapper (FDW) in PostgreSQL using S3 data.
  5. Executes analytics test queries on the multi-level partitioned table.

Tests

To run the demonstration test:

RUST_LOG=info \
    cargo test \
    --test \
    test_mlp_auto_sales \
    -- \
    --nocapture

Test traces are available in the attached log file :: https://gist.githubusercontent.com/shamb0/2ed909ac9604c610af1d7fa0e87f9a82/raw/02a4203cdc1d675181d9f9700c578c81405becdb/wk2434-pg_analytics-mlp-demo.md

CLAassistant commented 3 months ago

CLA assistant check
All committers have signed the CLA.

shamb0 commented 3 months ago

Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.

Hi @rebasedming,

I’ve put together a detailed README for the PR, which you can find here.

I aimed to be thorough in capturing all the necessary details. If you find it too lengthy or in need of restructuring, please let me know, and I’ll make the necessary adjustments.

rebasedming commented 3 months ago

Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.

Hi @rebasedming,

I’ve put together a detailed README for the PR, which you can find here.

I aimed to be thorough in capturing all the necessary details. If you find it too lengthy or in need of restructuring, please let me know, and I’ll make the necessary adjustments.

Nice, thank you for the extremely thorough writeup.

What I was most interested in is your approach of putting partitioned heap tables in front of foreign tables to pass partition keys to the Parquet file string. I wasn't aware this was possible and was hoping you could elaborate on that.

shamb0 commented 3 months ago

Thanks for pointing out the missing critical piece. I've introduced the section Partitioned Table Structure and S3 Integration with the necessary details, and I've also created a TL;DR quick overview.

shamb0 commented 3 months ago

Hi @rebasedming,

I've made some changes to address the clippy warnings in this PR:

  1. Created a new crate: pg_analytics_test_helpers
  2. Moved fixtures and common modules from the tests folder to this new crate

These changes were necessary to get the CI build passing. I know this might be outside the scope of our current PR, so let me know if you'd prefer I revert these changes.

Thanks for your input on this.

philippemnoel commented 3 months ago

Hi @rebasedming,

I've made some changes to address the clippy warnings in this PR:

  1. Created a new crate: pg_analytics_test_helpers
  2. Moved fixtures and common modules from the tests folder to this new crate

These changes were necessary to get the CI build passing. I know this might be outside the scope of our current PR, so let me know if you'd prefer I revert these changes.

Thanks for your input on this.

Can you explain which Clippy warnings you needed a new crate for? I don't think we should complicate the project with an extra crate. We can simply ignore some clippy warnings in the test files if they're causing issues.

philippemnoel commented 3 months ago

I also took a look at your CI error, and I'm a bit confused by it. I suspect this has to do with the newly introduced crate. If you keep a single crate, it shouldn't find two different PG versions.

shamb0 commented 3 months ago

Hi @philippemnoel,

I've identified an issue with commit 9491b824e5dfffd3d636dcb7bb4b67a3fe9ee858:

  1. cargo test runs successfully
  2. cargo clippy --all-targets fails

The problem:

I tried several solutions (e.g., using super) to fix the module path. But no luck. As a temporary fix, I've:

image

philippemnoel commented 3 months ago

Hi @philippemnoel,

I've identified an issue with commit 9491b824e5dfffd3d636dcb7bb4b67a3fe9ee858:

  1. cargo test runs successfully
  2. cargo clippy --all-targets fails

The problem:

  • Error occurs in: use crate::common::{execute_query, fetch_results, print_utils};
  • Clippy searches for the common module in the pg_analytics crate instead of the tests folder

I tried several solutions (e.g., using super) to fix the module path. But no luck. As a temporary fix, I've:

  • Moved the module fixture and common out of the tests folder
  • Created a new crate: pg_analytics_test_helpers

image

This code is added by your PR right? Can we instead fix the import path and keep things in the same crate?

shamb0 commented 2 months ago

Hi @philippemnoel,

The PR is ready for the next level of review. I've removed the duplicate code, performed a cleanup, and ensured proper integration with PR #91. Additionally, fixtures/tables/auto_sales.rs now follows the pattern used in other test implementations across the workspace.

philippemnoel commented 2 months ago

Hi @shamb0. This looks super clean! Thank you for integrating everything properly, I'm very excited about this PR.

I believe it should also have documentation, so that the users can know that partitions are supported and know how to use them. Our documentation is stored in https://github.com/paradedb/paradedb/tree/dev/docs. Would you be willing to submit a PR with documentation to that repository as well? Then I think everything will be complete :). I'll let Ming do a more thorough review

shamb0 commented 2 months ago

Hi @philippemnoel,

I wanted to update you on PR#1568, which includes recent documentation changes.

Currently, I’ve placed the new topic under ingest/configuration/multi-level-partitioned-tables. Could you please review and suggest if this is the most appropriate location, or if you have any preferred path for this topic?

image

philippemnoel commented 2 months ago

Sorry, we had to merge a few other PRs to get v0.1.1 out. There shouldn't be anything else that introduces a conflict, but could you please rebase it? We'll prioritize it :)

shamb0 commented 2 months ago

Hi @philippemnoel,

The rebase is complete, and the PR should now be ready for intake review. Please let me know if you encounter any issues.

Thanks again!

philippemnoel commented 2 months ago

Hi @philippemnoel,

The rebase is complete, and the PR should now be ready for intake review. Please let me know if you encounter any issues.

Thanks again!

Thank you! Could you please take a look at the failing test?

rebasedming commented 2 months ago

Hi @shamb0 , I've spent some time testing this PR and I have some bad news.

While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

shamb0 commented 2 months ago

Hi @shamb0 , I've spent some time testing this PR and I have some bad news.

While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

Hi @rebasedming,

Thank you for the insightful feedback and for highlighting the performance issue with the strategy used in the PR.

Based on your comments, I understand that the current approach of using partitioned heap tables in front of foreign tables bypasses the executor hook, which normally pushes queries to DuckDB. This results in slower query execution through PostgreSQL's FDW API, as it ends up performing a sequential scan of the underlying Parquet files.

I will investigate this issue further and work on an improved strategy that maintains the performance benefits of DuckDB. I’ll get back to you soon with a better solution.

Thanks for your patience!

philippemnoel commented 2 months ago

Hi @shamb0 , I've spent some time testing this PR and I have some bad news. While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

Hi @rebasedming,

Thank you for the insightful feedback and for highlighting the performance issue with the strategy used in the PR.

Based on your comments, I understand that the current approach of using partitioned heap tables in front of foreign tables bypasses the executor hook, which normally pushes queries to DuckDB. This results in slower query execution through PostgreSQL's FDW API, as it ends up performing a sequential scan of the underlying Parquet files.

I will investigate this issue further and work on an improved strategy that maintains the performance benefits of DuckDB. I’ll get back to you soon with a better solution.

Thanks for your patience!

We're excited to see the next iteration :)

rebasedming commented 2 months ago

Closing this now as per the above discussion