PostgreSQL 如何估算 HashAggregate 的 Return Rows ，以及低估的後果

相信 Query Planner 前，先看看 selfuncs.h

7e8a47dfc4d554b6f15d329bcbb69029_XL

序

PostgreSQL 資料庫是怎麼決定如何執行 SQL 的呢？

其中一個很重要的步驟就是預估 Return Rows 的數量，它被用來：

給底層的 Scan Node 比較各種 Access Method (Seq Scan, Index Scan, Bitmap Index Scan + Bitmap Heap Scan, …) 的 Cost
往往更重要的是 Return Rows 的估算對於上層其他 Plan Node 的 Cost 有巨大影響，例如 Nested Loop 或 Hash Join，因為預估它們就是會從底下的 Scan Node 拿出這麼多的 rows 數量來處理

最近我們同事在開發時就遇到 HashAggregate Node 的 Retrun Rows 數量被嚴重低估，導致資料庫最終選到一個費時比預期更長的執行路徑。

`HashAggregate` Node 是做什麼的？

其實我們常常可以在 PostgreSQL 的 EXPLAIN 輸出中看到 HashAggregate Node 的蹤影，它是我們在使用 DISTINCT, GROUP BY, WHERE IN 等 SQL 操作時負責幫我們做聚合操作的 Node。

其實作在 nodeAgg.c: https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/executor/nodeAgg.c#L3-L11

案例

我們的案例是想要找出所有訂閱某個 playlist 的使用者裝置，步驟是：

先從 playlist_subscriptions 表抓出訂閱者放進 CTE
再從 devices 表去 WHERE IN 剛剛的 CTE，找出這些訂閱者的裝置

實際的 Query Plan：

postgres=# 
EXPLAIN WITH subscribers AS MATERIALIZED (SELECT user_id FROM playlist_subscriptions WHERE list_id = 3343594)
SELECT * FROM devices WHERE user_id IN (SELECT user_id FROM subscribers);
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=33227.59..58222.98 rows=294587 width=170)
   CTE subscribers
     ->  Bitmap Heap Scan on playlist_subscriptions  (cost=1350.13..32705.43 rows=23188 width=8)
           Recheck Cond: (list_id = 3343594)
           ->  Bitmap Index Scan on playlist_subscriptions_list_id_user_id_uniq  (cost=0.00..1344.34 rows=23188 width=0)
                 Index Cond: (list_id = 3343594)
   ->  HashAggregate  (cost=521.73..523.73 rows=200 width=8)
         Group Key: subscribers.user_id
         ->  CTE Scan on subscribers  (cost=0.00..463.76 rows=23188 width=8)
   ->  Index Scan using devices_user_id_idx on devices  (cost=0.43..124.84 rows=13 width=170)
         Index Cond: (user_id = subscribers.user_id)
(11 rows)

從上面的 EXPLAIN 可以注意到，原本預估會從 playlist_subscriptions 這張表取出 23188 個 rows，但是經過 HashAggregate 這層之後，變成預估只取出 200 rows。

這最終導致低估了 Nested Loop * Index Scan 的 Cost (58222.98)。而在我們的資料庫搭配的又是較少的 CPU 跟 RAM 還有 HDD 的情況下，較大量的 Random Page Access 會相當緩慢。

這也促使我去理解資料庫到底是怎麼估算 HashAggregate 的 Retrun Rows？

這個 200 到底又是怎麼估算出來的呢？

讀原始碼

我們可以從 nodeAgg.c 中找到 HashAggregate，並從註解中可以知道他仰賴於 numGroups 的估算。 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/executor/nodeAgg.c#L172-L178

幾乎所有估算 selectivity 的 function 都可以在 selfuncs.c 中找到，而我們要找的 numGroups 的估算就是其中的 estimate_num_groups。 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L2978-L3045

estimate_num_groups 的註解也詳細說明了他的估算流程，這裡寫個大致的流程：

蒐集 GROUP BY 中每一個 column 或 expr 的 VariableStatData，包含欄位在 pg_statistic 中的 ndistinct, 還有 pg_class 中的 reltuples 數量，並儲存於 varinfos
檢查 varinfos 是否有可用的 multivariate n-distinct 統計可以使用，否則 GROUP BY 中每個 column 或 expr 都被當作獨立變數
估算 varinfos 中每個獨立變數在套上 restriction selectivity 之後的 ndistinct 並將他們相乘相乘得出最終 ndistinct

流程中值得一提的部分是 multivariate n-distinct 統計，這是 PostgreSQL 10 才開始有的功能：

我們可以針對關聯性高的複數欄位預先創建 multivariate n-distinct 統計資訊，否則它們在 GROUP BY 中將被當成個別的獨立變數，最終會造成 estimate_num_groups 高估。 Chapter 70.2 Multivariate Statistics Examples

套上 restriction selectivity 來算 ndistinct 的算法則是相對單純，在註解中也寫了概念與出處：

若一張表中總共有 N rows，其中總共有 n distinct value 並假設他們是 uniform distribution。那若從表中隨機取出 p rows，其中包含的 distinct value 可以用 n * (1 - ((N-p)/N)^(N/n)) 近似。

https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L3317-L3335

不過我們知道對我們來說最重要的是第一步準備每一個 VariableStatData 的部分，因為它的 ndistinct 會直接決定估算的結果。

每一個 VariableStatData 的 ndistinct 的值則是在 estimate_num_groups 中呼叫 add_unique_group_var 中的 get_variable_numdistinct 取得，流程如下：

嘗試從 pg_statistic 中取得既有的統計資料，若沒統計資料則設法猜猜看，例如透過型別、attribute number 等 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L3118-L3126 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L3152-L3162 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L2930-L2940 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L4966-L4989
檢查欄位是否 unique，如果是則放棄 pg_statistic 的 ndistinct 而使用 pg_class 中的 reltuples 取代 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L5040-L5049
最終若還是沒法知道合理的 ndistinct 就用 DEFAULT_NUM_DISTINCT https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L5056-L5063 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L5077-L5087

謎底揭曉： DEFAULT_NUM_DISTINCT 就是 200。 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/include/utils/selfuncs.h#L44-L47

驗證

為了驗證，我們直接修改 DEFAULT_NUM_DISTINCT 改成這個案例中接近實際的 20000，然後重新編譯 PostgreSQL 再重新跑一次 EXPLAIN：

postgres=# 
EXPLAIN WITH subscribers AS MATERIALIZED (SELECT user_id FROM playlist_subscriptions WHERE list_id = 3343594)
SELECT * FROM devices WHERE user_id IN (SELECT user_id FROM subscribers);
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=33677.16..245826.56 rows=294587 width=170)
   Hash Cond: (devices.user_id = subscribers.user_id)
   CTE subscribers
     ->  Bitmap Heap Scan on playlist_subscriptions  (cost=1350.13..32705.43 rows=23188 width=8)
           Recheck Cond: (list_id = 3343594)
           ->  Bitmap Index Scan on playlist_subscriptions_list_id_user_id_uniq  (cost=0.00..1344.34 rows=23188 width=0)
                 Index Cond: (list_id = 3343594)
   ->  Seq Scan on devices  (cost=0.00..194390.62 rows=5516762 width=170)
   ->  Hash  (cost=721.73..721.73 rows=20000 width=8)
         ->  HashAggregate  (cost=521.73..721.73 rows=20000 width=8)
               Group Key: subscribers.user_id
               ->  CTE Scan on subscribers  (cost=0.00..463.76 rows=23188 width=8)
(12 rows)

確實可以看到 HashAggregate Node 顯示的 return rows 已經改變成 20000，整體的 Plan 也改成 Seq Scan on devices + Hash Join subscribers，總 cost 估計是 245826.56。確實用這個 Plan 在我們的硬體配置上跑的會快很多。

那我們再來比較一下若是強制使用原先的 Nested Loop * Index Scan on devices 的總 cost 估計會是多少？

這邊我們先強制關閉了 hashjoin：

postgres=# SET enable_hashjoin=off;
SET
postgres=# 
EXPLAIN WITH subscribers AS MATERIALIZED (SELECT user_id FROM playlist_subscriptions WHERE list_id = 3343594)
SELECT * FROM devices WHERE user_id IN (SELECT user_id FROM subscribers);
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=33227.59..1384775.03 rows=294587 width=170)
   CTE subscribers
     ->  Bitmap Heap Scan on playlist_subscriptions  (cost=1350.13..32705.43 rows=23188 width=8)
           Recheck Cond: (list_id = 3343594)
           ->  Bitmap Index Scan on playlist_subscriptions_list_id_user_id_uniq  (cost=0.00..1344.34 rows=23188 width=0)
                 Index Cond: (list_id = 3343594)
   ->  HashAggregate  (cost=521.73..721.73 rows=20000 width=8)
         Group Key: subscribers.user_id
         ->  CTE Scan on subscribers  (cost=0.00..463.76 rows=23188 width=8)
   ->  Index Scan using devices_user_id_idx on devices  (cost=0.43..67.44 rows=13 width=170)
         Index Cond: (user_id = subscribers.user_id)
(11 rows)

是 1384775.03，確實原先的 58222.98 是嚴重低估的。

Cost 拆解

我們可以大致來拆解一下這兩個數字是怎麼來的，造成差異最主要就是 HashAggregate 預估的 return rows 差了一百倍：

HashAggregate 回傳 200 的時候：58222.98 ~= 33227.59 + 200*124.84
後來我們改成回傳 20000 的時候：1384775.03 ~= 33227.59 + 20000*67.44

其中 33227.59 是 Nested Loop 的 Startup Cost，而 124.84 與 67.44 則分別是兩次 EXPLAIN 的 Index Scan Total Cost

至於為什麼這邊 Index Scan 的 Cost 在 20000 時反而是 67.44 ，比僅有 200 時的 124.84 還要小？實際上是受於 costsize.c 中 max_IO_cost 的計算影響： https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/optimizer/path/costsize.c#L617

spc_random_page_cost 是設定中的random_page_cost 參數，我們用 10
loop_count 就是 HashAggregate 預估的 200 或我們改的 20000
pages_fetched 則是透過 Mackert and Lohman 於 1989 年提出的方法估算 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/optimizer/path/costsize.c#L599-L610 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/optimizer/path/costsize.c#L786-L827

這裡我們在 HashAggregate 預估 200 與 20000 時取得的 pages_fetched 分別是 2576 與 134454，對應的 max_IO_cost：

max_IO_cost = 2576 * 10 / 200 = 128.8
max_IO_cost = 134454 * 10 / 20000 = 67.227

這兩個 Cost 數字離 EXPLAIN 看到的還有一點差距是因為最終的 Costs 還會再套上 index correlation 的修正以及處理 tuple 的 cpu costs。

https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/optimizer/path/costsize.c#L708-L715

https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/optimizer/path/costsize.c#L716-L748

關於 index correlation，是指 index pages 的排序與實際插入 heap table 的順序關聯性，可以透過在 pg_stats 查詢。這裡也有篇文章有詳細介紹：cybertec-postgresql column-correlation-explained

解決方案

現在我們知道 HashAggregate 在沒有任何資訊的情況下，會直接預估出 DEFAULT_NUM_DISTINCT。這可能會嚴重影響整個執行計畫。

當然在生產環境上我們不太可能去改變 DEFAULT_NUM_DISTINCT 的值或是直接把 Index Scan 或 Nested Loop 關閉，在這次的例子中，我們只能選擇改寫 Query 繞過這個 HashAggregate。

回到這次的例子：我們是使用了 CTE 來暫存訂閱者的結果，所以才會缺乏任何統計資訊。但其實我們可以改寫成不要暫存或是不要用 CTE：

不要暫存 CTE ，這是 PostgreSQL 12 才支援的

postgres=# 
EXPLAIN WITH subscribers AS (SELECT user_id FROM playlist_subscriptions WHERE list_id = 3343594)
SELECT * FROM devices WHERE user_id IN (SELECT user_id FROM subscribers);
                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Hash Join  (cost=32995.28..241867.55 rows=294585 width=170)
Hash Cond: (devices.user_id = playlist_subscriptions.user_id)
->  Seq Scan on devices  (cost=0.00..194390.62 rows=5516762 width=170)
->  Hash  (cost=32705.43..32705.43 rows=23188 width=8)
     ->  Bitmap Heap Scan on playlist_subscriptions  (cost=1350.13..32705.43 rows=23188 width=8)
           Recheck Cond: (list_id = 3343594)
           ->  Bitmap Index Scan on playlist_subscriptions_list_id_user_id_uniq  (cost=0.00..1344.34 rows=23188 width=0)
                 Index Cond: (list_id = 3343594)
(8 rows)

直接寫入 WHERE IN

postgres=# 
EXPLAIN SELECT * FROM devices WHERE user_id IN (SELECT user_id FROM playlist_subscriptions WHERE list_id = 3343594);
                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Hash Join  (cost=32995.28..241867.55 rows=294585 width=170)
Hash Cond: (devices.user_id = playlist_subscriptions.user_id)
->  Seq Scan on devices  (cost=0.00..194390.62 rows=5516762 width=170)
->  Hash  (cost=32705.43..32705.43 rows=23188 width=8)
     ->  Bitmap Heap Scan on playlist_subscriptions  (cost=1350.13..32705.43 rows=23188 width=8)
           Recheck Cond: (list_id = 3343594)
           ->  Bitmap Index Scan on playlist_subscriptions_list_id_user_id_uniq  (cost=0.00..1344.34 rows=23188 width=0)
                 Index Cond: (list_id = 3343594)
(8 rows)

改用 JOIN

postgres=# 
EXPLAIN SELECT devices.* FROM devices JOIN playlist_subscriptions ON devices.user_id = playlist_subscriptions.user_id WHERE list_id = 3343594;
                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Hash Join  (cost=32995.28..241867.55 rows=294585 width=170)
Hash Cond: (devices.user_id = playlist_subscriptions.user_id)
->  Seq Scan on devices  (cost=0.00..194390.62 rows=5516762 width=170)
->  Hash  (cost=32705.43..32705.43 rows=23188 width=8)
     ->  Bitmap Heap Scan on playlist_subscriptions  (cost=1350.13..32705.43 rows=23188 width=8)
           Recheck Cond: (list_id = 3343594)
           ->  Bitmap Index Scan on playlist_subscriptions_list_id_user_id_uniq  (cost=0.00..1344.34 rows=23188 width=0)
                 Index Cond: (list_id = 3343594)
(8 rows)

以上三種寫法都會產生一樣的 Query Plan。

特別注意 PostgreSQL 12 已經可以把 CTE 與下方 Query 合併而不用暫存，但還是有相當嚴格的限制：

non-recursive
side-effect-free
called just once

若不符合這些條件，依然還是會與之前的版本一樣暫存起來。

總結

這次的案例源自於 HashAggregate 缺乏任何資訊去估算合理的 Return Rows 數量，所以直接回傳 DEFAULT_NUM_DISTINCT=200。

也就是說使用者在寫查詢時若有去查詢暫存的 CTE 或是其他缺乏統計的表或欄位時，自己心裡還是得有個底大概會查出多少 Rows。

若有用到 GROUP BY, DISTINCT 且結果數量可能與 DEFAULT_NUM_DISTINCT 相去太遠時又拿去做 WHERE IN 或 JOIN，則可能會讓 Planner 挑選到不好的執行路徑，最好還是先 EXPLAIN 看一下是否有需要改寫 Query。

另外在 PostgreSQL 10 之後，若是 WHERE, GROUP BY, DISTINCT 中有用到高度相關的欄位導致 Return Rows 被嚴重低估或高估，則可以預先創建 multivariate n-distinct 統計讓預估更準確：

CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;

SQL Commands - CREATE STATISTICS

連結

關於 Return Row Estimation 的說明除了 PostgreSQL 手冊中的第 70 章之外 Chapter 70. How the Planner Uses Statistics

幾乎所有相關的原始碼都可以在 selfuncs.h 以及 selfuncs.c 中找到

https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/include/utils/selfuncs.h#L23-L51 https://github.com/rueian/postgres/blob/16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b/src/backend/utils/adt/selfuncs.c#L1-L22 至少可以看看 selfuncs.h 裡面除了 DEFAULT_NUM_DISTINCT 還有哪些其他神奇的預設值。

rueian / postgres