Open Lily2025 opened 7 months ago
/type enhancement
/assign JinheLin
This error is that RPC timed out after recovering network partition . The reason is that after the network partition is recovered, some regions need to sync a lot of data, resulting in a wait index exceeding 1 minute, which leads to a timeout.
Suggest to increase the recovery time of this case to 10 minutes.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
cluster deploy with two wn and two cn 1、run ch go-tpc ch run -D tpcc --host tc-tidb.ha-test-disagg-tiflash-tps-7080664-1-490 -P4000 --warehouses 2000 -T 32 --acThreads 1 --queries q7 --ignore-error '2013,1213,1105,1205,8022,8028,9004,9007,1062' --time 36000m --user root --password '' --interval '10s' 2、inject network partition between one of wn and other all pods 3、recover fault after 10mins
2. What did you expect to see? (Required)
query should not report error after fault recover
3. What did you see instead (Required)
workload report “err execute query q7 failed Error 1105: other error for mpp stream: Code: 159, e.displayText() = DB::Exception: EstablishDisaggregated execution was interrupted, maximum execution time exceeded“ after network partition recover from one of wn
[2024-03-02 18:30:25] execute run failed, err execute query q7 failed Error 1105: other error for mpp stream: Code: 159, e.displayText() = DB::Exception: EstablishDisaggregated execution was interrupted, maximum execution time exceeded, wn_address=tc-tiflash-0.tc-tiflash-peer.ha-test-disagg-tiflash-tps-7080664-1-490.svc:3930 MPP<gather_id:1, query_ts:1709404165881642170, local_query_id:338, server_id:2011, start_ts:448110045647863978,task_id:11>, e.what() = DB::Exception,
4. What is your TiFlash version? (Required)
git-hash :42dba5e04ba466ab152840d811f954510f3bf0dc