Closed shamb0 closed 2 months ago
Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.
Hi @rebasedming,
I’ve put together a detailed README for the PR, which you can find here.
I aimed to be thorough in capturing all the necessary details. If you find it too lengthy or in need of restructuring, please let me know, and I’ll make the necessary adjustments.
Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.
Hi @rebasedming,
I’ve put together a detailed README for the PR, which you can find here.
I aimed to be thorough in capturing all the necessary details. If you find it too lengthy or in need of restructuring, please let me know, and I’ll make the necessary adjustments.
Nice, thank you for the extremely thorough writeup.
What I was most interested in is your approach of putting partitioned heap tables in front of foreign tables to pass partition keys to the Parquet file string. I wasn't aware this was possible and was hoping you could elaborate on that.
Thanks for pointing out the missing critical piece. I've introduced the section Partitioned Table Structure and S3 Integration with the necessary details, and I've also created a TL;DR quick overview.
Hi @rebasedming,
I've made some changes to address the clippy
warnings in this PR:
pg_analytics_test_helpers
fixtures
and common
modules from the tests
folder to this new crateThese changes were necessary to get the CI build passing. I know this might be outside the scope of our current PR, so let me know if you'd prefer I revert these changes.
Thanks for your input on this.
Hi @rebasedming,
I've made some changes to address the
clippy
warnings in this PR:
- Created a new crate:
pg_analytics_test_helpers
- Moved
fixtures
andcommon
modules from thetests
folder to this new crateThese changes were necessary to get the CI build passing. I know this might be outside the scope of our current PR, so let me know if you'd prefer I revert these changes.
Thanks for your input on this.
Can you explain which Clippy warnings you needed a new crate for? I don't think we should complicate the project with an extra crate. We can simply ignore some clippy warnings in the test files if they're causing issues.
I also took a look at your CI error, and I'm a bit confused by it. I suspect this has to do with the newly introduced crate. If you keep a single crate, it shouldn't find two different PG versions.
Hi @philippemnoel,
I've identified an issue with commit 9491b824e5dfffd3d636dcb7bb4b67a3fe9ee858
:
cargo test
runs successfullycargo clippy --all-targets
failsThe problem:
use crate::common::{execute_query, fetch_results, print_utils};
common
module in the pg_analytics
crate instead of the tests
folderI tried several solutions (e.g., using super
) to fix the module path. But no luck. As a temporary fix, I've:
fixture
and common
out of the tests
folderpg_analytics_test_helpers
Hi @philippemnoel,
I've identified an issue with commit
9491b824e5dfffd3d636dcb7bb4b67a3fe9ee858
:
cargo test
runs successfullycargo clippy --all-targets
failsThe problem:
- Error occurs in:
use crate::common::{execute_query, fetch_results, print_utils};
- Clippy searches for the
common
module in thepg_analytics
crate instead of thetests
folderI tried several solutions (e.g., using
super
) to fix the module path. But no luck. As a temporary fix, I've:
- Moved the module
fixture
andcommon
out of thetests
folder- Created a new crate:
pg_analytics_test_helpers
This code is added by your PR right? Can we instead fix the import path and keep things in the same crate?
Hi @philippemnoel,
The PR is ready for the next level of review. I've removed the duplicate code, performed a cleanup, and ensured proper integration with PR #91. Additionally, fixtures/tables/auto_sales.rs
now follows the pattern used in other test implementations across the workspace.
Hi @shamb0. This looks super clean! Thank you for integrating everything properly, I'm very excited about this PR.
I believe it should also have documentation, so that the users can know that partitions are supported and know how to use them. Our documentation is stored in https://github.com/paradedb/paradedb/tree/dev/docs. Would you be willing to submit a PR with documentation to that repository as well? Then I think everything will be complete :). I'll let Ming do a more thorough review
Hi @philippemnoel,
I wanted to update you on PR#1568, which includes recent documentation changes.
Currently, I’ve placed the new topic under ingest/configuration/multi-level-partitioned-tables
. Could you please review and suggest if this is the most appropriate location, or if you have any preferred path for this topic?
Sorry, we had to merge a few other PRs to get v0.1.1
out. There shouldn't be anything else that introduces a conflict, but could you please rebase it? We'll prioritize it :)
Hi @philippemnoel,
The rebase is complete, and the PR should now be ready for intake review. Please let me know if you encounter any issues.
Thanks again!
Hi @philippemnoel,
The rebase is complete, and the PR should now be ready for intake review. Please let me know if you encounter any issues.
Thanks again!
Thank you! Could you please take a look at the failing test?
Hi @shamb0 , I've spent some time testing this PR and I have some bad news.
While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics
, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.
Hi @shamb0 , I've spent some time testing this PR and I have some bad news.
While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In
pg_analytics
, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.
Hi @rebasedming,
Thank you for the insightful feedback and for highlighting the performance issue with the strategy used in the PR.
Based on your comments, I understand that the current approach of using partitioned heap tables in front of foreign tables bypasses the executor hook, which normally pushes queries to DuckDB. This results in slower query execution through PostgreSQL's FDW API, as it ends up performing a sequential scan of the underlying Parquet files.
I will investigate this issue further and work on an improved strategy that maintains the performance benefits of DuckDB. I’ll get back to you soon with a better solution.
Thanks for your patience!
Hi @shamb0 , I've spent some time testing this PR and I have some bad news. While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In
pg_analytics
, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.Hi @rebasedming,
Thank you for the insightful feedback and for highlighting the performance issue with the strategy used in the PR.
Based on your comments, I understand that the current approach of using partitioned heap tables in front of foreign tables bypasses the executor hook, which normally pushes queries to DuckDB. This results in slower query execution through PostgreSQL's FDW API, as it ends up performing a sequential scan of the underlying Parquet files.
I will investigate this issue further and work on an improved strategy that maintains the performance benefits of DuckDB. I’ll get back to you soon with a better solution.
Thanks for your patience!
We're excited to see the next iteration :)
Closing this now as per the above discussion
Closes #56
What
Implements a demonstration test for multi-level partition tables, addressing issue #56.
Why
This demo showcases the
pg_analytics
extension's capability to support multi-level partitioned tables. The implementation organizes data hierarchically, enabling efficient access to context-relevant information.How
pg_analytics
Foreign Data Wrapper (FDW) in PostgreSQL using S3 data.Tests
To run the demonstration test:
Test traces are available in the attached log file :: https://gist.githubusercontent.com/shamb0/2ed909ac9604c610af1d7fa0e87f9a82/raw/02a4203cdc1d675181d9f9700c578c81405becdb/wk2434-pg_analytics-mlp-demo.md