pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
945 stars 409 forks source link

workload report “err execute query q7 failed Error 1105: other error for mpp stream: Code: 159, e.displayText() = DB::Exception: EstablishDisaggregated execution was interrupted, maximum execution time exceeded“ after network partition recover from one of wn #8815

Open Lily2025 opened 7 months ago

Lily2025 commented 7 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

cluster deploy with two wn and two cn 1、run ch go-tpc ch run -D tpcc --host tc-tidb.ha-test-disagg-tiflash-tps-7080664-1-490 -P4000 --warehouses 2000 -T 32 --acThreads 1 --queries q7 --ignore-error '2013,1213,1105,1205,8022,8028,9004,9007,1062' --time 36000m --user root --password '' --interval '10s' 2、inject network partition between one of wn and other all pods 3、recover fault after 10mins image

2. What did you expect to see? (Required)

query should not report error after fault recover

3. What did you see instead (Required)

workload report “err execute query q7 failed Error 1105: other error for mpp stream: Code: 159, e.displayText() = DB::Exception: EstablishDisaggregated execution was interrupted, maximum execution time exceeded“ after network partition recover from one of wn

[2024-03-02 18:30:25] execute run failed, err execute query q7 failed Error 1105: other error for mpp stream: Code: 159, e.displayText() = DB::Exception: EstablishDisaggregated execution was interrupted, maximum execution time exceeded, wn_address=tc-tiflash-0.tc-tiflash-peer.ha-test-disagg-tiflash-tps-7080664-1-490.svc:3930 MPP<gather_id:1, query_ts:1709404165881642170, local_query_id:338, server_id:2011, start_ts:448110045647863978,task_id:11>, e.what() = DB::Exception,

4. What is your TiFlash version? (Required)

git-hash :42dba5e04ba466ab152840d811f954510f3bf0dc

Lily2025 commented 7 months ago

/type enhancement

Lily2025 commented 7 months ago

/assign JinheLin

JinheLin commented 7 months ago

This error is that RPC timed out after recovering network partition . The reason is that after the network partition is recovered, some regions need to sync a lot of data, resulting in a wait index exceeding 1 minute, which leads to a timeout.

Suggest to increase the recovery time of this case to 10 minutes.