Closed shamb0 closed 1 month ago
Hi @rebasedming,
I'd like to provide an update on the progress of the investigation:
Could you please provide feedback on the approach used for handling multi-level partition table queries in the execution hook? Additionally, I would appreciate any suggestions on how it can be further refined.
Hi @rebasedming,
I'd like to provide an update on the progress of the investigation:
- I have refactored the execution hook to handle multi-level partition tables, enabling query pushdown to DuckDB.
- I completed a sanity test using the existing auto sales dataset, and everything is working as expected.
Could you please provide feedback on the approach used for handling multi-level partition table queries in the execution hook? Additionally, I would appreciate any suggestions on how it can be further refined.
Hi @shamb0 -
I admire your perseverance chasing down this issue! It's a tricky one.
I know this is just a draft, but I don't feel good about this implementation. Intercepting and rewriting the query feels pretty unsafe and introduces a lot of complexity to the code base. I'm open to being convinced but I think this is way too much technical overhead for just this one feature.
Hi @rebasedming,
Thank you for your thorough review and candid feedback on the PR code patch. I truly appreciate the time you've taken and your openness to further discussion.
Regarding your concerns:
Safety and Complexity:
pg_analytics
. It's not a new pattern in our system.Current State and Future Plans:
Moving Forward:
I'm looking forward to our continued collaboration on this.
Hi @rebasedming,
Firstly, I'd like to acknowledge that my earlier patches may have been overly complex. Thank you for your patience as I worked through this.
I have some good news to share! Today, I discovered the hive_partitioning
option in ParquetOption
, which I had previously overlooked. After some experimentation, I found that this leads to a much simpler solution than I initially proposed.
Key point: This simpler approach doesn't require any complex intercepting or SQL statement remapping at the executor
hook.
Next steps: I'll be cleaning up the code based on this new approach. I expect to have the updated patch ready by tomorrow morning.
For your reference, I've attached a snapshot of the PostgreSQL server trace.
Thank you for your guidance throughout this process. I appreciate your support.
2024-08-30 22:01:28.251 IST [590249] STATEMENT: drop database if exists "_sqlx_test_640";
2024-08-30 22:01:33.544 IST [590248] LOG: statement:
DROP TABLE IF EXISTS auto_sales CASCADE;
2024-08-30 22:01:33.544 IST [590248] NOTICE: table "auto_sales" does not exist, skipping
2024-08-30 22:01:33.544 IST [590248] LOG: statement:
DROP SERVER IF EXISTS auto_sales_server CASCADE;
2024-08-30 22:01:33.544 IST [590248] NOTICE: server "auto_sales_server" does not exist, skipping
2024-08-30 22:01:33.544 IST [590248] LOG: statement:
DROP FOREIGN DATA WRAPPER IF EXISTS parquet_wrapper CASCADE;
2024-08-30 22:01:33.544 IST [590248] NOTICE: foreign-data wrapper "parquet_wrapper" does not exist, skipping
2024-08-30 22:01:33.545 IST [590248] LOG: statement:
DROP USER MAPPING IF EXISTS FOR public SERVER auto_sales_server;
2024-08-30 22:01:33.545 IST [590248] NOTICE: server "auto_sales_server" does not exist, skipping
2024-08-30 22:01:33.545 IST [590248] LOG: statement: CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler
VALIDATOR parquet_fdw_validator
2024-08-30 22:01:33.545 IST [590248] WARNING: pga:: *** ParquetFdw::validator() X ***
2024-08-30 22:01:33.545 IST [590248] WARNING: pga:: *** ParquetFdw::validator() Y ***
2024-08-30 22:01:33.551 IST [590248] LOG: statement: CREATE SERVER auto_sales_server
FOREIGN DATA WRAPPER parquet_wrapper
2024-08-30 22:01:33.551 IST [590248] WARNING: pga:: *** ParquetFdw::validator() X ***
2024-08-30 22:01:33.551 IST [590248] WARNING: pga:: *** ParquetFdw::validator() Y ***
2024-08-30 22:01:33.553 IST [590248] LOG: statement: CREATE USER MAPPING FOR public
SERVER auto_sales_server
OPTIONS (
type 'S3',
region 'us-east-1',
endpoint 'localhost:33182',
use_ssl 'false',
url_style 'path'
)
2024-08-30 22:01:33.553 IST [590248] WARNING: pga:: *** ParquetFdw::validator() X ***
2024-08-30 22:01:33.553 IST [590248] WARNING: pga:: *** ParquetFdw::validator() Y ***
2024-08-30 22:01:33.556 IST [590248] LOG: statement:
CREATE FOREIGN TABLE auto_sales (
sale_id BIGINT,
sale_date DATE,
manufacturer TEXT,
model TEXT,
price DOUBLE PRECISION,
dealership_id INT,
customer_id INT,
year INT,
month INT
)
SERVER auto_sales_server
OPTIONS (
files 's3://demo-mlp-auto-sales/year=*/manufacturer=*/data_*.parquet',
hive_partitioning '1'
);
...
2024-08-30 22:01:33.606 IST [590248] WARNING: pga:: *** fdw::register_duckdb_view() Y ***
2024-08-30 22:01:33.606 IST [590248] LOG: execute sqlx_s_2:
SELECT year, manufacturer, ROUND(SUM(price)::numeric, 4)::float8 as total_sales
FROM auto_sales
WHERE year BETWEEN 2020 AND 2024
GROUP BY year, manufacturer
ORDER BY year, total_sales DESC;
2024-08-30 22:01:33.606 IST [590248] WARNING: pga:: *** executor_run() X ***
2024-08-30 22:01:33.606 IST [590248] WARNING: pga:: *** ExtensionHook::executor_run() X ***
2024-08-30 22:01:33.606 IST [590248] WARNING: pga:: *** get_current_query() X ***
2024-08-30 22:01:33.606 IST [590248] WARNING: pga:: *** get_current_query() "\n SELECT year, manufacturer, ROUND(SUM(price)::numeric, 4)::float8 as total_sales\n FROM auto_sales\n WHERE year BETWEEN 2020 AND 2024\n GROUP BY year, manufacturer\n ORDER BY year, total_sales DESC" ***
2024-08-30 22:01:33.606 IST [590248] WARNING: pga:: *** get_query_relations() X ***
2024-08-30 22:01:33.606 IST [590248] WARNING: pga:: *** get_query_relations() 1 ***
2024-08-30 22:01:33.606 IST [590248] WARNING: query_relations.is_empty() :: false
2024-08-30 22:01:33.606 IST [590248] WARNING: pga:: *** duckdb::create_arrow() "\n SELECT year, manufacturer, ROUND(SUM(price)::numeric, 4)::float8 as total_sales\n FROM auto_sales\n WHERE year BETWEEN 2020 AND 2024\n GROUP BY year, manufacturer\n ORDER BY year, total_sales DESC" ***
2024-08-30 22:01:34.091 IST [590248] WARNING: pga:: *** duckdb::create_arrow() Y ***
2024-08-30 22:01:34.091 IST [590248] WARNING: pga:: *** ExtensionHook::write_batches_to_slots() X 4 ***
2024-08-30 22:01:34.091 IST [590248] WARNING: pga:: *** ExtensionHook::write_batches_to_slots() Y ***
2024-08-30 22:01:34.091 IST [590248] WARNING: pga:: *** ExtensionHook::executor_run() Y ***
2024-08-30 22:01:34.091 IST [590248] WARNING: pga:: *** executor_run() Y ***
2024-08-30 22:06:23.245 IST [61966] LOG: checkpoint starting: time
2024-08-30 22:07:58.164 IST [61966] LOG: checkpoint complete: wrote 948 buffers (5.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=94.872 s, sync=0.024 s, total=94.920 s; sync files=304, longest=0.006 s, average=0.001 s; distance=4300 kB, estimate=4306 kB; lsn=0/955A6278, redo lsn=0/955A6240
Hi @philippemnoel,
I have resolved the code conflicts, and the branch should now be ready for intake review and potential merge. :)
Hi @philippemnoel,
I have resolved the code conflicts, and the branch should now be ready for intake review and potential merge. :)
Thank you! I’m personally very excited for this iteration, it feels much better. We’re pretty busy with a few big features right now, but we’ll review ASAP and I’m hopeful we can get this merged next week. @rebasedming is the final authority on any merge, so whenever he has some time he’ll take a look!
@Weijun-H, thanks for your feedback! I’ve addressed all your comments. Let me know if there’s anything else to improve.
Hi @philippemnoel,
Thank you for the review comments, I really appreciate it. Moving forward, I'll aim for smaller PRs with minimal changes to make them easier to review.
Closes #56
What
Implements a demonstration test for multi-level partition tables, addressing issue #56.
Why
This demonstration showcases the
pg_analytics
extension's ability to support multi-level partitioned tables. By using Hive-style partitioning for organizing data hierarchically, it enables efficient access to context-specific information, enhancing query performance and scalability.How
The implementation involves two key components:
Hive-style Partitioning in S3
FOREIGN TABLE Configuration in
pg_analytics
FOREIGN TABLE
with options to access the Hive-style partitioned dataset:Tests
To run the demonstration test, use the following command: