paradedb / pg_analytics

DuckDB-powered analytics for Postgres
https://paradedb.com
PostgreSQL License
389 stars 15 forks source link

feat: implement basic `EXPLAIN` statement #64

Closed kysshsy closed 2 months ago

kysshsy commented 3 months ago

Ticket(s) Closed

What

Why

How

Tests

kysshsy commented 3 months ago

Hi. I am confused about why we implement a executor hook, rather than FDW join pushdown (or custom scan)

  1. Join with postgres tables (seem we can't do that using executor hook)
  2. Executor Hook is more complex; we not only need to consider how to push down the query but also need to adapt to PostgreSQL's internal infrastructure.
  3. currently, there are problems related to Utility Prepare , Execute and explain, which are associated with the Executor Hook.

I mean it seems that using FDW (Foreign Data Wrapper) is more compatible with PostgreSQL's existing capabilities. (Maybe it's simpler to implement)

rebasedming commented 3 months ago

Hi. I am confused about why we implement a executor hook, rather than FDW join pushdown (or custom scan)

  1. Join with postgres tables (seem we can't do that using executor hook)
  2. Executor Hook is more complex; we not only need to consider how to push down the query but also need to adapt to PostgreSQL's internal infrastructure.
  3. currently, there are problems related to Utility Prepare , Execute and explain, which are associated with the Executor Hook.

I mean it seems that using FDW (Foreign Data Wrapper) is more compatible with PostgreSQL's existing capabilities. (Maybe it's simpler to implement)

Good question! It's a performance optimization. The FDW is able to push down predicates, limits, and quals but not other things like built-in aggregates. By using the executor hook we can let DuckDB handle the entire query.

kysshsy commented 3 months ago

@rebasedming Hi. What do you think of this PR? Duckdb explain format is quite different from Postgres explain. What do you think we should print ? Duckdb's whole physical plan or a "duckdb scan: query" indicates we pushdown the query to Duckdb. And feel free to change it.

rebasedming commented 3 months ago

@rebasedming Hi. What do you think of this PR? Duckdb explain format is quite different from Postgres explain. What do you think we should print ? Duckdb's whole physical plan or a "duckdb scan: query" indicates we pushdown the query to Duckdb. And feel free to change it.

I like it so far! Seems correct.

If it's possible to get Duckdb to return a query plan that looks somewhat like Postgres' query plans, we can show that to the user. If Duckdb only returns those formatted query plans to the user with the flow chart diagrams, I think we shouldn't return that and it's totally fine to just show the query string, ie

QUERY PLAN
--------------
DuckDB Scan: SELECT COUNT(*) from trips;

One challenge you'll encounter is that not all of Postgres' EXPLAIN options (like VERBOSE) are supported by DuckDB. I think it's valid to just throw an error on those.

kysshsy commented 2 months ago

Hi, this PR is ready for view. Since now we don't have a good way to convert Duckdb query plan to postgres style. Now it's show that we pushdown the query to duckdb. And it will report error when explai with options.

There are some ideas we could imporve in the future.

  1. support analyze option add actual run times if explain with this option.
  2. add style option add a style option, let user to choose the plan style. If user choose duckdb style, we could return Duckdb explain result. (Do you think this is a good idea? In PostgreSQL, the options for the EXPLAIN command are specified as key-value pairs. However, in the sqlparser, these options are implemented as fixed options. To implment this, some modifications to the sqlparser are necessary.)
philippemnoel commented 2 months ago

Hi, this PR is ready for view. Since now we don't have a good way to convert Duckdb query plan to postgres style. Now it's show that we pushdown the query to duckdb. And it will report error when explai with options.

There are some ideas we could imporve in the future.

  1. support analyze option add actual run times if explain with this option.
  2. add style option add a style option, let user to choose the plan style. If user choose duckdb style, we could return Duckdb explain result. (Do you think this is a good idea? In PostgreSQL, the options for the EXPLAIN command are specified as key-value pairs. However, in the sqlparser, these options are implemented as fixed options. To implment this, some modifications to the sqlparser are necessary.)

This is sweet! Heads up @rebasedming is the final authority here and he's very busy right now with some initiatives on pg_search, so he may be a bit slow to review. Could you please open up GitHub Issues for your suggested next steps from this work? This will help us track the work.

kysshsy commented 2 months ago

Hi, this PR is ready for view. Since now we don't have a good way to convert Duckdb query plan to postgres style. Now it's show that we pushdown the query to duckdb. And it will report error when explai with options. There are some ideas we could imporve in the future.

  1. support analyze option add actual run times if explain with this option.
  2. add style option add a style option, let user to choose the plan style. If user choose duckdb style, we could return Duckdb explain result. (Do you think this is a good idea? In PostgreSQL, the options for the EXPLAIN command are specified as key-value pairs. However, in the sqlparser, these options are implemented as fixed options. To implment this, some modifications to the sqlparser are necessary.)

This is sweet! Heads up @rebasedming is the final authority here and he's very busy right now with some initiatives on pg_search, so he may be a bit slow to review. Could you please open up GitHub Issues for your suggested next steps from this work? This will help us track the work.

Ok. When I have some free time, I'll organize the details and open a GitHub Issue to track the next steps. Maybe tomorrow

philippemnoel commented 2 months ago

Hi, this PR is ready for view. Since now we don't have a good way to convert Duckdb query plan to postgres style. Now it's show that we pushdown the query to duckdb. And it will report error when explai with options. There are some ideas we could imporve in the future.

  1. support analyze option add actual run times if explain with this option.
  2. add style option add a style option, let user to choose the plan style. If user choose duckdb style, we could return Duckdb explain result. (Do you think this is a good idea? In PostgreSQL, the options for the EXPLAIN command are specified as key-value pairs. However, in the sqlparser, these options are implemented as fixed options. To implment this, some modifications to the sqlparser are necessary.)

This is sweet! Heads up @rebasedming is the final authority here and he's very busy right now with some initiatives on pg_search, so he may be a bit slow to review. Could you please open up GitHub Issues for your suggested next steps from this work? This will help us track the work.

Ok. When I have some free time, I'll organize the details and open a GitHub Issue to track the next steps. Maybe tomorrow

Thank you! We can then review + merge

rebasedming commented 2 months ago

I tested this branch against the trips table from the quickstart and got some weird results.

pg_analytics=# explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1.01 rows=1 width=20)
   ->  Append  (cost=0.00..7.04 rows=7 width=20)
         ->  Foreign Scan on trips_2023_01 trips_1  (cost=0.00..1.00 rows=1 width=0)
               DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2023_01 LIMIT 1
         ->  Foreign Scan on trips_2023_02 trips_2  (cost=0.00..1.00 rows=1 width=0)
               DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2023_02 LIMIT 1
         ->  Foreign Scan on trips_2023_03 trips_3  (cost=0.00..1.00 rows=1 width=0)
               DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2023_03 LIMIT 1
         ->  Foreign Scan on trips_2023_04 trips_4  (cost=0.00..1.00 rows=1 width=0)
               DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2023_04 LIMIT 1
         ->  Foreign Scan on trips_2024_02 trips_5  (cost=0.00..1.00 rows=1 width=0)
               DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2024_02 LIMIT 1
         ->  Foreign Scan on trips_2024_03 trips_6  (cost=0.00..1.00 rows=1 width=0)
               DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2024_03 LIMIT 1
         ->  Foreign Scan on trips_2024_04 trips_7  (cost=0.00..1.00 rows=1 width=0)
               DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM public.trips_2024_04 LIMIT 1
(16 rows)
kysshsy commented 2 months ago
explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;

@rebasedming . It looks like a partitioned table's query plan.I am unable to reproduce this.

pg_analytics=# CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');
CREATE FOREIGN TABLE
pg_analytics=# explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1
(1 row)

pg_analytics=# \d+ trips
                                                        Foreign table "public.trips"
        Column         |            Type             | Collation | Nullable | Default | FDW options | Storage  | Stats target | Description
-----------------------+-----------------------------+-----------+----------+---------+-------------+----------+--------------+-------------
 vendorid              | integer                     |           |          |         |             | plain    |              |
 tpep_pickup_datetime  | timestamp without time zone |           |          |         |             | plain    |              |
 tpep_dropoff_datetime | timestamp without time zone |           |          |         |             | plain    |              |
 passenger_count       | bigint                      |           |          |         |             | plain    |              |
 trip_distance         | double precision            |           |          |         |             | plain    |              |
 ratecodeid            | bigint                      |           |          |         |             | plain    |              |
 store_and_fwd_flag    | character varying           |           |          |         |             | extended |              |
 pulocationid          | integer                     |           |          |         |             | plain    |              |
 dolocationid          | integer                     |           |          |         |             | plain    |              |
 payment_type          | bigint                      |           |          |         |             | plain    |              |
 fare_amount           | double precision            |           |          |         |             | plain    |              |
 extra                 | double precision            |           |          |         |             | plain    |              |
 mta_tax               | double precision            |           |          |         |             | plain    |              |
 tip_amount            | double precision            |           |          |         |             | plain    |              |
 tolls_amount          | double precision            |           |          |         |             | plain    |              |
 improvement_surcharge | double precision            |           |          |         |             | plain    |              |
 total_amount          | double precision            |           |          |         |             | plain    |              |
 congestion_surcharge  | double precision            |           |          |         |             | plain    |              |
 airport_fee           | double precision            |           |          |         |             | plain    |              |
Server: parquet_server
FDW options: (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet')

Could you show me the result of \d+ trips ? And please also provide me with the complete SQL statement to create the table. I am very interested in observing the current behavior of the EXPLAIN for partitioned tables.

I didn't consider the case of partitioned tables before, and I feel that my code shouldn't produce this result. It really is a bit strange.

However, this query plan is actually reasonable, with limits, appends, and separate queries for different partitions. (Although it may not correspond with the current executor pushdown.)

rebasedming commented 2 months ago
explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;

@rebasedming . It looks like a partitioned table's query plan.I am unable to reproduce this.

pg_analytics=# CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');
CREATE FOREIGN TABLE
pg_analytics=# explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1
(1 row)

pg_analytics=# \d+ trips
                                                        Foreign table "public.trips"
        Column         |            Type             | Collation | Nullable | Default | FDW options | Storage  | Stats target | Description
-----------------------+-----------------------------+-----------+----------+---------+-------------+----------+--------------+-------------
 vendorid              | integer                     |           |          |         |             | plain    |              |
 tpep_pickup_datetime  | timestamp without time zone |           |          |         |             | plain    |              |
 tpep_dropoff_datetime | timestamp without time zone |           |          |         |             | plain    |              |
 passenger_count       | bigint                      |           |          |         |             | plain    |              |
 trip_distance         | double precision            |           |          |         |             | plain    |              |
 ratecodeid            | bigint                      |           |          |         |             | plain    |              |
 store_and_fwd_flag    | character varying           |           |          |         |             | extended |              |
 pulocationid          | integer                     |           |          |         |             | plain    |              |
 dolocationid          | integer                     |           |          |         |             | plain    |              |
 payment_type          | bigint                      |           |          |         |             | plain    |              |
 fare_amount           | double precision            |           |          |         |             | plain    |              |
 extra                 | double precision            |           |          |         |             | plain    |              |
 mta_tax               | double precision            |           |          |         |             | plain    |              |
 tip_amount            | double precision            |           |          |         |             | plain    |              |
 tolls_amount          | double precision            |           |          |         |             | plain    |              |
 improvement_surcharge | double precision            |           |          |         |             | plain    |              |
 total_amount          | double precision            |           |          |         |             | plain    |              |
 congestion_surcharge  | double precision            |           |          |         |             | plain    |              |
 airport_fee           | double precision            |           |          |         |             | plain    |              |
Server: parquet_server
FDW options: (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet')

Could you show me the result of \d+ trips ? And please also provide me with the complete SQL statement to create the table. I am very interested in observing the current behavior of the EXPLAIN for partitioned tables.

I didn't consider the case of partitioned tables before, and I feel that my code shouldn't produce this result. It really is a bit strange.

However, this query plan is actually reasonable, with limits, appends, and separate queries for different partitions. (Although it may not correspond with the current executor pushdown.)

You're right, trips was a partitioned table. I forgot I had it sitting around. Upon closer examination, I do think

explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;

@rebasedming . It looks like a partitioned table's query plan.I am unable to reproduce this.

pg_analytics=# CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');
CREATE FOREIGN TABLE
pg_analytics=# explain SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 DuckDB Scan: SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1
(1 row)

pg_analytics=# \d+ trips
                                                        Foreign table "public.trips"
        Column         |            Type             | Collation | Nullable | Default | FDW options | Storage  | Stats target | Description
-----------------------+-----------------------------+-----------+----------+---------+-------------+----------+--------------+-------------
 vendorid              | integer                     |           |          |         |             | plain    |              |
 tpep_pickup_datetime  | timestamp without time zone |           |          |         |             | plain    |              |
 tpep_dropoff_datetime | timestamp without time zone |           |          |         |             | plain    |              |
 passenger_count       | bigint                      |           |          |         |             | plain    |              |
 trip_distance         | double precision            |           |          |         |             | plain    |              |
 ratecodeid            | bigint                      |           |          |         |             | plain    |              |
 store_and_fwd_flag    | character varying           |           |          |         |             | extended |              |
 pulocationid          | integer                     |           |          |         |             | plain    |              |
 dolocationid          | integer                     |           |          |         |             | plain    |              |
 payment_type          | bigint                      |           |          |         |             | plain    |              |
 fare_amount           | double precision            |           |          |         |             | plain    |              |
 extra                 | double precision            |           |          |         |             | plain    |              |
 mta_tax               | double precision            |           |          |         |             | plain    |              |
 tip_amount            | double precision            |           |          |         |             | plain    |              |
 tolls_amount          | double precision            |           |          |         |             | plain    |              |
 improvement_surcharge | double precision            |           |          |         |             | plain    |              |
 total_amount          | double precision            |           |          |         |             | plain    |              |
 congestion_surcharge  | double precision            |           |          |         |             | plain    |              |
 airport_fee           | double precision            |           |          |         |             | plain    |              |
Server: parquet_server
FDW options: (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet')

Could you show me the result of \d+ trips ? And please also provide me with the complete SQL statement to create the table. I am very interested in observing the current behavior of the EXPLAIN for partitioned tables.

I didn't consider the case of partitioned tables before, and I feel that my code shouldn't produce this result. It really is a bit strange.

However, this query plan is actually reasonable, with limits, appends, and separate queries for different partitions. (Although it may not correspond with the current executor pushdown.)

Woops, you're totally right! I had made the trips table partitioned and forgot about it.

What I did was I created a normal heap table trips with year and month columns, and made the partitions foreign tables. Something like:

CREATE TABLE public.trips (
    vendorid              integer,
    tpep_pickup_datetime  timestamp without time zone,
    tpep_dropoff_datetime timestamp without time zone,
    passenger_count       bigint,
    trip_distance         double precision,
    ratecodeid            bigint,
    store_and_fwd_flag    character varying,
    pulocationid          integer,
    dolocationid          integer,
    payment_type          bigint,
    fare_amount           double precision,
    extra                 double precision,
    mta_tax               double precision,
    tip_amount            double precision,
    tolls_amount          double precision,
    improvement_surcharge double precision,
    total_amount          double precision,
    congestion_surcharge  double precision,
    airport_fee           double precision,
    year                  integer,
    month                 integer
) PARTITION BY LIST (year);

CREATE TABLE public.trips_2023 PARTITION OF public.trips
FOR VALUES IN (2023)
PARTITION BY LIST (month);

CREATE FOREIGN TABLE public.trips_2023_january PARTITION OF public.trips_2023
FOR VALUES IN (1)
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');

The output of the EXPLAIN I posted actually looks correct...feel free to ignore this comment.