Closed lilinghai closed 10 months ago
The MPP task start around 09:03
tidb.log
[2024/01/02 09:03:25.391 +08:00] [INFO] [local_mpp_coordinator.go:217] ["Dispatch mpp task"] [timestamp=446734623143100416] [ID=14] [QueryTs=1704157405385719988] [LocalQueryId=750] [ServerID=1491] [address=tc-tiflash-1.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [plan="Table(order_line)->Send(17, 18, 19, 20, )"] [mpp-version=2] [exchange-compression-mode=FAST] [GatherID=1] [resource_group=default]
tiflash.log
[2024/01/02 09:03:25.396 +08:00] [DEBUG] [MPPTask.cpp:462] ["begin to make the task MPP<gather_id:1, query_ts:1704157405385719988, local_query_id:750, server_id:1491, start_ts:446734623143100416,task_id:14> public"] [source="MPP<gather_id:1, query_ts:1704157405385719988, local_query_id:750, server_id:1491, start_ts:446734623143100416,task_id:14>"] [thread_id=1733]
However, there are some regions that can not resolve locks correctly for more than 30 minutes
[2024/01/02 09:40:40.865 +08:00] [INFO] [FlashService.cpp:218] ["Handling coprocessor request, is_remote_read: true, start ts: 446734623143100416, region info: {5921, 35, 85}, resource_group: default, conn_id: 0, conn_alias: "] [source=FlashService] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [INFO] [FlashService.cpp:271] ["Begin process cop request after wait 0 ms, start ts: 446734623143100416, region info: {5921, 35, 85}"] [source=FlashService] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [CoprocessorHandler.cpp:126] ["Handling DAG request: time_zone_offset: 28800\nflags: 226\noutput_offsets: 0\noutput_offsets: 1\noutput_offsets: 2\nencode_type: TypeCHBlock\nsql_mode: 0\ntime_zone_name: \"Asia/Shanghai\"\ncollect_execution_summaries: false\nroot_executor {\n tp: TypeTableScan\n tbl_scan {\n table_id: 2680\n columns {\n column_id: 1\n tp: 3\n collation: -63\n columnLen: 11\n decimal: 0\n flag: 4099\n pk_handle: false\n array: false\n }\n columns {\n column_id: 2\n tp: 3\n collation: -63\n columnLen: 11\n decimal: 0\n flag: 4099\n pk_handle: false\n array: false\n }\n columns {\n column_id: 3\n tp: 3\n collation: -63\n columnLen: 11\n decimal: 0\n flag: 4099\n pk_handle: false\n array: false\n }\n desc: false\n next_read_engine: Local\n keep_order: false\n is_fast_scan: false\n max_wait_time_ms: 0\n }\n executor_id: \"TableFullScan_43\"\n}\nforce_encode_type: true\n"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [executeQuery.cpp:398] ["(from 10.200.64.131:33102, query_id: b4abfaef-b2e9-4cd4-8b68-ffcf1e0fbe5b) time_zone_offset: 28800 flags: 226 output_offsets: 0 output_offsets: 1 output_offsets: 2 encode_type: TypeCHBlock sql_mode: 0 time_zone_name: \"Asia/Shanghai\" collect_execution_summaries: false root_executor { tp: TypeTableScan tbl_scan { table_id: 2680 columns { column_id: 1 tp: 3 collation: -63 columnLen: 11 decimal: 0 flag: 4099 pk_handle: false array: false } columns { column_id: 2 tp: 3 collation: -63 columnLen: 11 decimal: 0 flag: 4099 pk_handle: false array: false } columns { column_id: 3 tp: 3 collation: -63 columnLen: 11 decimal: 0 flag: 4099 pk_handle: false array: false } desc: false next_read_engine: Local keep_order: false is_fast_scan: false max_wait_time_ms: 0 } executor_id: \"TableFullScan_43\" } force_encode_type: true"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [PhysicalPlan.cpp:84] ["tidb table scan has runtime filter size:0"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [PhysicalPlan.cpp:299] ["build unoptimized physical plan: \n<Projection, TableFullScan_43> | is_tidb_operator: false, schema: <TableFullScan_43_table_scan_0, Int32>, <TableFullScan_43_table_scan_1, Int32>, <TableFullScan_43_table_scan_2, Int32>\n <TableScan, TableFullScan_43> | is_tidb_operator: true, schema: <table_scan_0, Int32>, <table_scan_1, Int32>, <table_scan_2, Int32>\n"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [PhysicalPlan.cpp:302] ["build optimized physical plan: \n<Projection, TableFullScan_43> | is_tidb_operator: false, schema: <TableFullScan_43_table_scan_0, Int32>, <TableFullScan_43_table_scan_1, Int32>, <TableFullScan_43_table_scan_2, Int32>\n <TableScan, TableFullScan_43> | is_tidb_operator: true, schema: <table_scan_0, Int32>, <table_scan_1, Int32>, <table_scan_2, Int32>\n"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [LearnerReadWorker.cpp:317] ["[Learner Read] Batch read index, num_regions=1 num_requests=1 num_stale_read=0 num_cached_index=0 num_unavailable=0 cost=0ms"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [INFO] [LearnerReadWorker.cpp:416] ["[Learner Read] Finish wait index and resolve locks, wait_cost=0ms n_regions=1 n_unavailable=1"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
But tiflash handle the raft log/admin log (region merge) correctly according to the logging.
2912960:[2024/01/02 09:19:47.911 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 84}"] [thread_id=1502]
2912983:[2024/01/02 09:20:02.779 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 84}"] [thread_id=1085]
2913023:[2024/01/02 09:20:03.461 +08:00] [INFO] [RaftCommands.cpp:152] ["[region_id=318] execute prepare merge, min_index 19, target region_id=5921"] [thread_id=62]
2913027:[2024/01/02 09:20:03.462 +08:00] [INFO] [RaftCommands.cpp:244] ["[region_id=5921 applied_term=7 applied_index=5630262] execute admin command CommitMerge at [term: 7, index: 5630263]"] [thread_id=61]
2913028:[2024/01/02 09:20:03.462 +08:00] [INFO] [RaftCommands.cpp:186] ["[region_id=5921] execute commit merge, source region_id=318, commit index=19"] [thread_id=61]
2913029:[2024/01/02 09:20:03.462 +08:00] [INFO] [RaftCommands.cpp:291] ["After execute merge cmd, current region info: [region_id=5921 index=5630263 table_id=2680 ver=85 conf_ver=35 state=Normal peer=id: 5933 store_id: 26 role: Learner]"] [thread_id=61]
2913032:[2024/01/02 09:20:03.462 +08:00] [INFO] [KVStore.cpp:393] ["Start to persist [region_id=5921 index=5630263 table_id=2680 ver=85 conf_ver=35 state=Normal peer=id: 5933 store_id: 26 role: Learner], cache size: 0 bytes for `admin raft cmd `"] [thread_id=61]
2913033:[2024/01/02 09:20:03.462 +08:00] [DEBUG] [KVStore.cpp:395] ["Persist [region_id=5921] done, cache size: 0 bytes"] [thread_id=61]
2913066:[2024/01/02 09:20:17.452 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1500]
2913077:[2024/01/02 09:20:23.481 +08:00] [DEBUG] [KVStore.cpp:557] ["[region_id=5921] approx mem cache info: rows 0, bytes 0, gap 1/200"] [thread_id=61]
2913108:[2024/01/02 09:20:31.789 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1085]
2913130:[2024/01/02 09:20:45.828 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1529]
So this should be an issue that client-c/tidb doesn't resolve locks correctly
There are some long-running txn from the tidb's log:
> rg 'primary_lock' tidb.log | head 2024-01-02 17:24:26
[2024/01/02 08:26:39.574 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734039555506176] [storeAddr=tc-tiflash-2.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=22] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\001\\254\\003\\200\\000\\000\\000\\000\\000\\000\\003\"\nlock_version: 446734038936584200\nkey: \"t\\200\\000\\000\\000\\000\\000\\nz_r\\200\\000\\000\\000\\002\\0010U\"\nlock_ttl: 20005\ntxn_size: 1\nlock_for_update_ts: 446734038936584200\nuse_async_commit: true\nmin_commit_ts: 446734038936584282\n, e.what() = DB::Exception,"]
[2024/01/02 08:26:39.584 +08:00] [INFO] [conn.go:1132] ["command dispatched failed"] [conn=3126854430] [session_alias=] [connInfo="id:3126854430, addr:10.200.72.180:55474 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="set @@tidb_read_staleness='-60'; select c_last, c_id o_id, o_entry_d, o_ol_cnt, sum(ol_amount) from customer, orders, order_line where c_id = o_c_id and c_w_id = o_w_id and c_d_id = o_d_id and ol_w_id = o_w_id and ol_d_id = o_d_id and ol_o_id = o_id group by o_id, o_w_id, o_d_id, c_id, c_last, o_entry_d, o_ol_cnt having sum(ol_amount) > 200 order by sum(ol_amount) desc, o_entry_d "] [txn_mode=PESSIMISTIC] [timestamp=0] [err="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\001\\254\\003\\200\\000\\000\\000\\000\\000\\000\\003\"\nlock_version: 446734038936584200\nkey: \"t\\200\\000\\000\\000\\000\\000\\nz_r\\200\\000\\000\\000\\002\\0010U\"\nlock_ttl: 20005\ntxn_size: 1\nlock_for_update_ts: 446734038936584200\nuse_async_commit: true\nmin_commit_ts: 446734038936584282\n, e.what() = DB::Exception,\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleMPPStreamResponse\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:631\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).receiveResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:514\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleDispatchReq\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:467\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).dispatchAll.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:372\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"]
[2024/01/02 08:27:00.882 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734044911370240] [storeAddr=tc-tiflash-1.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=33] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\000\\316\\003\\200\\000\\000\\000\\000\\000\\000\\005\"\nlock_version: 446734038936584240\nkey: \"t\\200\\000\\000\\000\\000\\000\\n~_r\\200\\000\\000\\000\\001$zl\"\nlock_ttl: 20010\ntxn_size: 1\nlock_for_update_ts: 446734038936584240\nuse_async_commit: true\nmin_commit_ts: 446734038936584300\n, e.what() = DB::Exception,"]
[2024/01/02 08:27:00.887 +08:00] [INFO] [conn.go:1132] ["command dispatched failed"] [conn=3126854430] [session_alias=] [connInfo="id:3126854430, addr:10.200.72.180:55474 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="set @@tidb_read_staleness='-60'; select s_nationkey as supp_nation, substr(c_state,1,1) as cust_nation, extract(year from o_entry_d) as l_year, sum(ol_amount) as revenue from supplier, stock, order_line, orders, customer, nation n1, nation n2 where ol_supply_w_id = s_w_id and ol_i_id = s_i_id and mod((s_w_id * s_i_id), 10000) = s_suppkey and ol_w_id = o_w_id and ol_d_id = o_d_id and ol_o_id = o_id and c_id = o_c_id and c_w_id = o_w_id and c_d_id = o_d_id and s_nationkey = n1.n_nationkey and ascii(substr(c_state,1,1)) - 65 = n2.n_nationkey and ( (n1.n_name = 'JAPAN' and n2.n_name = 'CHINA') or (n1.n_name = 'CHINA' and n2.n_name = 'JAPAN') ) and ol_delivery_d between '2007-01-02 00:00:00.000000' and '2032-01-02 00:00:00.000000' group by s_nationkey, substr(c_state,1,1), extract(year from o_entry_d) order by s_nationkey, cust_nation, l_year; "] [txn_mode=PESSIMISTIC] [timestamp=0] [err="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\000\\316\\003\\200\\000\\000\\000\\000\\000\\000\\005\"\nlock_version: 446734038936584240\nkey: \"t\\200\\000\\000\\000\\000\\000\\n~_r\\200\\000\\000\\000\\001$zl\"\nlock_ttl: 20010\ntxn_size: 1\nlock_for_update_ts: 446734038936584240\nuse_async_commit: true\nmin_commit_ts: 446734038936584300\n, e.what() = DB::Exception,\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleMPPStreamResponse\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:631\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).receiveResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:514\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleDispatchReq\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:467\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).dispatchAll.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:372\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"]
[2024/01/02 08:27:21.774 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734050495823872] [storeAddr=tc-tiflash-2.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=5] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\002\\373\\003\\200\\000\\000\\000\\000\\000\\000\\006\"\nlock_version: 446734038962798593\nkey: \"t\\200\\000\\000\\000\\000\\000\\n|_r\\200\\000\\000\\000\\023\\351\\036}\"\nlock_ttl: 20006\ntxn_size: 6\nlock_for_update_ts: 446734038962798593\nuse_async_commit: true\nmin_commit_ts: 446734038962798595\n, e.what() = DB::Exception,"]
[2024/01/02 08:27:21.774 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734050495823872] [storeAddr=tc-tiflash-1.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=6] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\002\\373\\003\\200\\000\\000\\000\\000\\000\\000\\006\"\nlock_version: 446734038962798593\nkey: \"t\\200\\000\\000\\000\\000\\000\\n|_r\\200\\000\\000\\000\\023\\351\\036}\"\nlock_ttl: 20006\ntxn_size: 6\nlock_for_update_ts: 446734038962798593\nuse_async_commit: true\nmin_commit_ts: 446734038962798595\n, e.what() = DB::Exception,"]
[2024/01/02 08:27:21.774 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734050495823872] [storeAddr=tc-tiflash-3.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=8] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\002\\373\\003\\200\\000\\000\\000\\000\\000\\000\\006\"\nlock_version: 446734038962798593\nkey: \"t\\200\\000\\000\\000\\000\\000\\n|_r\\200\\000\\000\\000\\023\\351\\036}\"\nlock_ttl: 20006\ntxn_size: 6\nlock_for_update_ts: 446734038962798593\nuse_async_commit: true\nmin_commit_ts: 446734038962798595\n, e.what() = DB::Exception,"]
[2024/01/02 08:27:21.775 +08:00] [INFO] [conn.go:1132] ["command dispatched failed"] [conn=3126854430] [session_alias=] [connInfo="id:3126854430, addr:10.200.72.180:55474 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="set @@tidb_read_staleness='-60'; select ol_number, sum(ol_quantity) as sum_qty, sum(ol_amount) as sum_amount, avg(ol_quantity) as avg_qty, avg(ol_amount) as avg_amount, count(*) as count_order from order_line where ol_delivery_d > '2007-01-02 00:00:00.000000' group by ol_number order by ol_number; "] [txn_mode=PESSIMISTIC] [timestamp=0] [err="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\002\\373\\003\\200\\000\\000\\000\\000\\000\\000\\006\"\nlock_version: 446734038962798593\nkey: \"t\\200\\000\\000\\000\\000\\000\\n|_r\\200\\000\\000\\000\\023\\351\\036}\"\nlock_ttl: 20006\ntxn_size: 6\nlock_for_update_ts: 446734038962798593\nuse_async_commit: true\nmin_commit_ts: 446734038962798595\n, e.what() = DB::Exception,\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleMPPStreamResponse\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:631\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).receiveResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:514\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleDispatchReq\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:467\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).dispatchAll.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:372\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"]
> rg '446734038949691420' tidb.log
29855:[2024/01/02 09:03:19.235 +08:00] [WARN] [expensivequery.go:145] [expensive_query] [cost_time=2282.358999937s] [conn=3126854432] [user=root] [database=tpcc] [txn_start_ts=446734038949691420] [mem_max="16384 Bytes (16 KB)"] [sql=COMMIT] [session_alias=]
29873:[2024/01/02 09:03:24.734 +08:00] [INFO] [region_request.go:1532] ["throwing pseudo region error due to no replica available"] [conn=3126854432] [session_alias=] [req-ts=446734038949691420] [req-type=Prewrite] [region="{ region id: 6011, ver: 101, confVer: 21 }"] [replica-read-type=leader] [stale-read=false] [request-sender="{rpcError:wait recvLoop timeout,timeout:30s, wait_duration:1.187µs:: context deadline exceeded,replicaSelector: replicaSelector{selectorStateStr: accessKnownLeader, cacheRegionIsValid: false, replicaStatus: [peer: 6012, store: 1, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 6013, store: 14, isEpochStale: false, attempts: 10, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 6014, store: 5, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [retry-times=14] [total-backoff-ms=597704] [total-backoff-times=90] [max-exec-timeout-ms=20000] [total-region-errors="6013-deadline_exceeded:12, 6014-not_leader-6013:1, 6012-not_leader-6013:1"]
29874:[2024/01/02 09:03:24.863 +08:00] [WARN] [prewrite.go:291] ["slow prewrite request"] [startTS=446734038949691420] [region="{ region id: 6011, ver: 101, confVer: 21 }"] [attempts=8]
29922:[2024/01/02 09:03:26.868 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2310000]
29980:[2024/01/02 09:03:36.867 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2320000]
29994:[2024/01/02 09:03:46.867 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2330000]
30025:[2024/01/02 09:03:56.867 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2340000]
30034:[2024/01/02 09:04:06.867 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2350000]
30039:[2024/01/02 09:04:10.911 +08:00] [INFO] [region_request.go:1532] ["send request on region error failed, err: region unavailable"] [conn=3126854432] [session_alias=] [req-ts=446734038949691420] [req-type=Prewrite] [region="{ region id: 6011, ver: 101, confVer: 21 }"] [replica-read-type=leader] [stale-read=false] [request-sender="{rpcError:wait recvLoop timeout,timeout:30s, wait_duration:1.187µs:: context deadline exceeded,replicaSelector: replicaSelector{selectorStateStr: accessKnownLeader, cacheRegionIsValid: true, replicaStatus: [peer: 6012, store: 1, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 6013, store: 14, isEpochStale: false, attempts: 2, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 6014, store: 5, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [retry-times=1] [total-backoff-ms=603876] [total-backoff-times=92] [max-exec-timeout-ms=20000] [total-region-errors=6013-deadline_exceeded:2]
30040:[2024/01/02 09:04:10.911 +08:00] [ERROR] [2pc.go:1544] ["Async commit/1PC result undetermined"] [conn=3126854432] [session_alias=] [error="region unavailable"] [rpcErr="wait recvLoop timeout,timeout:30s, wait_duration:1.187µs:: context deadline exceeded"] [txnStartTS=446734038949691420]
The root cause is that one apply thread of one TiKV got stuck. So resolve lock rpc got stuck too. TiKV has fixed this issue https://github.com/tikv/tikv/issues/16340.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
stale read test
2. What did you expect to see? (Required)
3. What did you see instead (Required)
4. What is your TiFlash version? (Required)
master