Closed kysshsy closed 2 months ago
Hi. I am confused about why we implement a executor hook, rather than FDW join pushdown (or custom scan)
I mean it seems that using FDW (Foreign Data Wrapper) is more compatible with PostgreSQL's existing capabilities. (Maybe it's simpler to implement)
Hi. I am confused about why we implement a executor hook, rather than FDW join pushdown (or custom scan)
- Join with postgres tables (seem we can't do that using executor hook)
- Executor Hook is more complex; we not only need to consider how to push down the query but also need to adapt to PostgreSQL's internal infrastructure.
- currently, there are problems related to Utility Prepare , Execute and explain, which are associated with the Executor Hook.
I mean it seems that using FDW (Foreign Data Wrapper) is more compatible with PostgreSQL's existing capabilities. (Maybe it's simpler to implement)
Good question! It's a performance optimization. The FDW is able to push down predicates, limits, and quals but not other things like built-in aggregates. By using the executor hook we can let DuckDB handle the entire query.
@rebasedming Hi. What do you think of this PR? Duckdb explain format is quite different from Postgres explain. What do you think we should print ? Duckdb's whole physical plan or a "duckdb scan: query" indicates we pushdown the query to Duckdb. And feel free to change it.
@rebasedming Hi. What do you think of this PR? Duckdb explain format is quite different from Postgres explain. What do you think we should print ? Duckdb's whole physical plan or a "duckdb scan: query" indicates we pushdown the query to Duckdb. And feel free to change it.
I like it so far! Seems correct.
If it's possible to get Duckdb to return a query plan that looks somewhat like Postgres' query plans, we can show that to the user. If Duckdb only returns those formatted query plans to the user with the flow chart diagrams, I think we shouldn't return that and it's totally fine to just show the query string, ie
QUERY PLAN
--------------
DuckDB Scan: SELECT COUNT(*) from trips;
One challenge you'll encounter is that not all of Postgres' EXPLAIN options (like VERBOSE) are supported by DuckDB. I think it's valid to just throw an error on those.
Hi, this PR is ready for view. Since now we don't have a good way to convert Duckdb query plan to postgres style. Now it's show that we pushdown the query to duckdb. And it will report error when explai with options.
There are some ideas we could imporve in the future.
Hi, this PR is ready for view. Since now we don't have a good way to convert Duckdb query plan to postgres style. Now it's show that we pushdown the query to duckdb. And it will report error when explai with options.
There are some ideas we could imporve in the future.
- support analyze option add actual run times if explain with this option.
- add style option add a style option, let user to choose the plan style. If user choose duckdb style, we could return Duckdb explain result. (Do you think this is a good idea? In PostgreSQL, the options for the EXPLAIN command are specified as key-value pairs. However, in the sqlparser, these options are implemented as fixed options. To implment this, some modifications to the sqlparser are necessary.)
This is sweet! Heads up @rebasedming is the final authority here and he's very busy right now with some initiatives on pg_search
, so he may be a bit slow to review. Could you please open up GitHub Issues for your suggested next steps from this work? This will help us track the work.
Hi, this PR is ready for view. Since now we don't have a good way to convert Duckdb query plan to postgres style. Now it's show that we pushdown the query to duckdb. And it will report error when explai with options. There are some ideas we could imporve in the future.
- support analyze option add actual run times if explain with this option.
- add style option add a style option, let user to choose the plan style. If user choose duckdb style, we could return Duckdb explain result. (Do you think this is a good idea? In PostgreSQL, the options for the EXPLAIN command are specified as key-value pairs. However, in the sqlparser, these options are implemented as fixed options. To implment this, some modifications to the sqlparser are necessary.)
This is sweet! Heads up @rebasedming is the final authority here and he's very busy right now with some initiatives on
pg_search
, so he may be a bit slow to review. Could you please open up GitHub Issues for your suggested next steps from this work? This will help us track the work.
Ok. When I have some free time, I'll organize the details and open a GitHub Issue to track the next steps. Maybe tomorrow
Hi, this PR is ready for view. Since now we don't have a good way to convert Duckdb query plan to postgres style. Now it's show that we pushdown the query to duckdb. And it will report error when explai with options. There are some ideas we could imporve in the future.
- support analyze option add actual run times if explain with this option.
- add style option add a style option, let user to choose the plan style. If user choose duckdb style, we could return Duckdb explain result. (Do you think this is a good idea? In PostgreSQL, the options for the EXPLAIN command are specified as key-value pairs. However, in the sqlparser, these options are implemented as fixed options. To implment this, some modifications to the sqlparser are necessary.)
This is sweet! Heads up @rebasedming is the final authority here and he's very busy right now with some initiatives on
pg_search
, so he may be a bit slow to review. Could you please open up GitHub Issues for your suggested next steps from this work? This will help us track the work.Ok. When I have some free time, I'll organize the details and open a GitHub Issue to track the next steps. Maybe tomorrow
Thank you! We can then review + merge
I tested this branch against the trips
table from the quickstart and got some weird results.
pg_analytics=# explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.01 rows=1 width=20)
-> Append (cost=0.00..7.04 rows=7 width=20)
-> Foreign Scan on trips_2023_01 trips_1 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2023_01 LIMIT 1
-> Foreign Scan on trips_2023_02 trips_2 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2023_02 LIMIT 1
-> Foreign Scan on trips_2023_03 trips_3 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2023_03 LIMIT 1
-> Foreign Scan on trips_2023_04 trips_4 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2023_04 LIMIT 1
-> Foreign Scan on trips_2024_02 trips_5 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2024_02 LIMIT 1
-> Foreign Scan on trips_2024_03 trips_6 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2024_03 LIMIT 1
-> Foreign Scan on trips_2024_04 trips_7 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2024_04 LIMIT 1
(16 rows)
explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
@rebasedming . It looks like a partitioned table's query plan.I am unable to reproduce this.
pg_analytics=# CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');
CREATE FOREIGN TABLE
pg_analytics=# explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
QUERY PLAN
---------------------------------------------------------------------------------
DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1
(1 row)
pg_analytics=# \d+ trips
Foreign table "public.trips"
Column | Type | Collation | Nullable | Default | FDW options | Storage | Stats target | Description
-----------------------+-----------------------------+-----------+----------+---------+-------------+----------+--------------+-------------
vendorid | integer | | | | | plain | |
tpep_pickup_datetime | timestamp without time zone | | | | | plain | |
tpep_dropoff_datetime | timestamp without time zone | | | | | plain | |
passenger_count | bigint | | | | | plain | |
trip_distance | double precision | | | | | plain | |
ratecodeid | bigint | | | | | plain | |
store_and_fwd_flag | character varying | | | | | extended | |
pulocationid | integer | | | | | plain | |
dolocationid | integer | | | | | plain | |
payment_type | bigint | | | | | plain | |
fare_amount | double precision | | | | | plain | |
extra | double precision | | | | | plain | |
mta_tax | double precision | | | | | plain | |
tip_amount | double precision | | | | | plain | |
tolls_amount | double precision | | | | | plain | |
improvement_surcharge | double precision | | | | | plain | |
total_amount | double precision | | | | | plain | |
congestion_surcharge | double precision | | | | | plain | |
airport_fee | double precision | | | | | plain | |
Server: parquet_server
FDW options: (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet')
Could you show me the result of \d+ trips
? And please also provide me with the complete SQL statement to create the table. I am very interested in observing the current behavior of the EXPLAIN for partitioned tables.
I didn't consider the case of partitioned tables before, and I feel that my code shouldn't produce this result. It really is a bit strange.
However, this query plan is actually reasonable, with limits, appends, and separate queries for different partitions. (Although it may not correspond with the current executor pushdown.)
explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
@rebasedming . It looks like a partitioned table's query plan.I am unable to reproduce this.
pg_analytics=# CREATE FOREIGN TABLE trips () SERVER parquet_server OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet'); CREATE FOREIGN TABLE pg_analytics=# explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1; QUERY PLAN --------------------------------------------------------------------------------- DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1 (1 row) pg_analytics=# \d+ trips Foreign table "public.trips" Column | Type | Collation | Nullable | Default | FDW options | Storage | Stats target | Description -----------------------+-----------------------------+-----------+----------+---------+-------------+----------+--------------+------------- vendorid | integer | | | | | plain | | tpep_pickup_datetime | timestamp without time zone | | | | | plain | | tpep_dropoff_datetime | timestamp without time zone | | | | | plain | | passenger_count | bigint | | | | | plain | | trip_distance | double precision | | | | | plain | | ratecodeid | bigint | | | | | plain | | store_and_fwd_flag | character varying | | | | | extended | | pulocationid | integer | | | | | plain | | dolocationid | integer | | | | | plain | | payment_type | bigint | | | | | plain | | fare_amount | double precision | | | | | plain | | extra | double precision | | | | | plain | | mta_tax | double precision | | | | | plain | | tip_amount | double precision | | | | | plain | | tolls_amount | double precision | | | | | plain | | improvement_surcharge | double precision | | | | | plain | | total_amount | double precision | | | | | plain | | congestion_surcharge | double precision | | | | | plain | | airport_fee | double precision | | | | | plain | | Server: parquet_server FDW options: (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet')
Could you show me the result of
\d+ trips
? And please also provide me with the complete SQL statement to create the table. I am very interested in observing the current behavior of the EXPLAIN for partitioned tables.I didn't consider the case of partitioned tables before, and I feel that my code shouldn't produce this result. It really is a bit strange.
However, this query plan is actually reasonable, with limits, appends, and separate queries for different partitions. (Although it may not correspond with the current executor pushdown.)
You're right, trips
was a partitioned table. I forgot I had it sitting around. Upon closer examination, I do think
explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
@rebasedming . It looks like a partitioned table's query plan.I am unable to reproduce this.
pg_analytics=# CREATE FOREIGN TABLE trips () SERVER parquet_server OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet'); CREATE FOREIGN TABLE pg_analytics=# explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1; QUERY PLAN --------------------------------------------------------------------------------- DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1 (1 row) pg_analytics=# \d+ trips Foreign table "public.trips" Column | Type | Collation | Nullable | Default | FDW options | Storage | Stats target | Description -----------------------+-----------------------------+-----------+----------+---------+-------------+----------+--------------+------------- vendorid | integer | | | | | plain | | tpep_pickup_datetime | timestamp without time zone | | | | | plain | | tpep_dropoff_datetime | timestamp without time zone | | | | | plain | | passenger_count | bigint | | | | | plain | | trip_distance | double precision | | | | | plain | | ratecodeid | bigint | | | | | plain | | store_and_fwd_flag | character varying | | | | | extended | | pulocationid | integer | | | | | plain | | dolocationid | integer | | | | | plain | | payment_type | bigint | | | | | plain | | fare_amount | double precision | | | | | plain | | extra | double precision | | | | | plain | | mta_tax | double precision | | | | | plain | | tip_amount | double precision | | | | | plain | | tolls_amount | double precision | | | | | plain | | improvement_surcharge | double precision | | | | | plain | | total_amount | double precision | | | | | plain | | congestion_surcharge | double precision | | | | | plain | | airport_fee | double precision | | | | | plain | | Server: parquet_server FDW options: (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet')
Could you show me the result of
\d+ trips
? And please also provide me with the complete SQL statement to create the table. I am very interested in observing the current behavior of the EXPLAIN for partitioned tables.I didn't consider the case of partitioned tables before, and I feel that my code shouldn't produce this result. It really is a bit strange.
However, this query plan is actually reasonable, with limits, appends, and separate queries for different partitions. (Although it may not correspond with the current executor pushdown.)
Woops, you're totally right! I had made the trips
table partitioned and forgot about it.
What I did was I created a normal heap table trips
with year
and month
columns, and made the partitions foreign tables. Something like:
CREATE TABLE public.trips (
vendorid integer,
tpep_pickup_datetime timestamp without time zone,
tpep_dropoff_datetime timestamp without time zone,
passenger_count bigint,
trip_distance double precision,
ratecodeid bigint,
store_and_fwd_flag character varying,
pulocationid integer,
dolocationid integer,
payment_type bigint,
fare_amount double precision,
extra double precision,
mta_tax double precision,
tip_amount double precision,
tolls_amount double precision,
improvement_surcharge double precision,
total_amount double precision,
congestion_surcharge double precision,
airport_fee double precision,
year integer,
month integer
) PARTITION BY LIST (year);
CREATE TABLE public.trips_2023 PARTITION OF public.trips
FOR VALUES IN (2023)
PARTITION BY LIST (month);
CREATE FOREIGN TABLE public.trips_2023_january PARTITION OF public.trips_2023
FOR VALUES IN (1)
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');
The output of the EXPLAIN I posted actually looks correct...feel free to ignore this comment.
Ticket(s) Closed
What
Why
How
Tests