pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
941 stars 410 forks source link

task execution time should have an upper limit in pipeline model #9433

Open windtalker opened 5 days ago

windtalker commented 5 days ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

In pipeline model, when task is scheduled, it will be executed in fixed size thread pool, so for each task, the per-round execution time must have an upper limit, otherwise, it could block all the other queries. Normally, the per-round exectuion time should be predictable and short enough because it is by design that in each round of exection, the task will only process one block of data(normally less than 60k rows), so we can just assume it already has a reasonable upper limit, but still in some special cases, some operators may take very very long time to process one block of the data, we need to handle this explicitly. Special attention should be spent on the following kind of operators:

This issue aims to identify and improve operators that may seriously affect the normal operation of the system. For other operators, since their timing is predictable, they are not the focus of this issue.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

SeaRise commented 1 day ago

In velox, the operator gets task->execute_time to determine whether it has timed out. Perhaps we can put the Task pointer in thread_local to get the execution time

windtalker commented 1 day ago

In velox, the operator gets task->execute_time to determine whether it has timed out. Perhaps we can put the Task pointer in thread_local to get the execution time

But we can not simply return error if execution time exceed the limit?

SeaRise commented 1 day ago

In velox, the operator gets task->execute_time to determine whether it has timed out. Perhaps we can put the Task pointer in thread_local to get the execution time

But we can not simply return error if execution time exceed the limit?

Ah, yes, the need for a join probe that can return after probing a portion of the data (more thorough than before