risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.79k stars 562 forks source link

Bug: A batch query with sort agg stucks. #16490

Open chenzl25 opened 4 months ago

chenzl25 commented 4 months ago

Describe the bug

One of users encountered an SQL query (with 270 million rows in the ods.ods_mk_mcos table) that kept running without returning any results. Monitoring the batch, it was noticed that the RowSeqScan stopped fetching data after a while. Additionally, even after killing the query, there was a residual MPP task on one CN node, causing the version to be pinned and not released. This SQL query consistently reproduced the issue in one of our users‘ environment. In the execution plan, SortAgg and MergeSortExchange were observed. Disabling SortAgg and MergeSortExchange allowed the SQL query to execute normally. It seems like there might be a deadlock somewhere. FYI, MergeSortExchange differs from regular Exchange in that it requires fetching data from all input parallelism before returning.

SQL: SELECT count(1) AS cnt, company_id FROM ods.ods_mk_mcos_files GROUPBY company_id ORDER BY cnt DESC LIMIT 100; Plan:

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 BatchTopN { order: [count(1:Int32) DESC], limit: 100, offset: 0 }
 └─BatchExchange { order: [], dist: Single }
   └─BatchTopN { order: [count(1:Int32) DESC], limit: 100, offset: 0 }
     └─BatchProject { exprs: [count(1:Int32), ods_mk_mcos_files.company_id] }
       └─BatchSortAgg { group_key: [ods_mk_mcos_files.company_id], aggs: [count(1:Int32)] }
         └─BatchExchange { order: [ods_mk_mcos_files.company_id ASC], dist: HashShard(ods_mk_mcos_files.company_id) }
           └─BatchProject { exprs: [ods_mk_mcos_files.company_id, 1:Int32] }
             └─BatchScan { table: ods_mk_mcos_files, columns: [company_id] }
(8 rows)

image

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

fuyufjh commented 4 months ago

Await tree for batch? 🤔