pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
948 stars 410 forks source link

Read snapshot can't be released for long time #8630

Closed lilinghai closed 10 months ago

lilinghai commented 10 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

stale read test image

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

master

JaySon-Huang commented 10 months ago

The MPP task start around 09:03

tidb.log

[2024/01/02 09:03:25.391 +08:00] [INFO] [local_mpp_coordinator.go:217] ["Dispatch mpp task"] [timestamp=446734623143100416] [ID=14] [QueryTs=1704157405385719988] [LocalQueryId=750] [ServerID=1491] [address=tc-tiflash-1.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [plan="Table(order_line)->Send(17, 18, 19, 20, )"] [mpp-version=2] [exchange-compression-mode=FAST] [GatherID=1] [resource_group=default]

tiflash.log

[2024/01/02 09:03:25.396 +08:00] [DEBUG] [MPPTask.cpp:462] ["begin to make the task MPP<gather_id:1, query_ts:1704157405385719988, local_query_id:750, server_id:1491, start_ts:446734623143100416,task_id:14> public"] [source="MPP<gather_id:1, query_ts:1704157405385719988, local_query_id:750, server_id:1491, start_ts:446734623143100416,task_id:14>"] [thread_id=1733]

However, there are some regions that can not resolve locks correctly for more than 30 minutes

[2024/01/02 09:40:40.865 +08:00] [INFO] [FlashService.cpp:218] ["Handling coprocessor request, is_remote_read: true, start ts: 446734623143100416, region info: {5921, 35, 85}, resource_group: default, conn_id: 0, conn_alias: "] [source=FlashService] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [INFO] [FlashService.cpp:271] ["Begin process cop request after wait 0 ms, start ts: 446734623143100416, region info: {5921, 35, 85}"] [source=FlashService] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [CoprocessorHandler.cpp:126] ["Handling DAG request: time_zone_offset: 28800\nflags: 226\noutput_offsets: 0\noutput_offsets: 1\noutput_offsets: 2\nencode_type: TypeCHBlock\nsql_mode: 0\ntime_zone_name: \"Asia/Shanghai\"\ncollect_execution_summaries: false\nroot_executor {\n  tp: TypeTableScan\n  tbl_scan {\n    table_id: 2680\n    columns {\n      column_id: 1\n      tp: 3\n      collation: -63\n      columnLen: 11\n      decimal: 0\n      flag: 4099\n      pk_handle: false\n      array: false\n    }\n    columns {\n      column_id: 2\n      tp: 3\n      collation: -63\n      columnLen: 11\n      decimal: 0\n      flag: 4099\n      pk_handle: false\n      array: false\n    }\n    columns {\n      column_id: 3\n      tp: 3\n      collation: -63\n      columnLen: 11\n      decimal: 0\n      flag: 4099\n      pk_handle: false\n      array: false\n    }\n    desc: false\n    next_read_engine: Local\n    keep_order: false\n    is_fast_scan: false\n    max_wait_time_ms: 0\n  }\n  executor_id: \"TableFullScan_43\"\n}\nforce_encode_type: true\n"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [executeQuery.cpp:398] ["(from 10.200.64.131:33102, query_id: b4abfaef-b2e9-4cd4-8b68-ffcf1e0fbe5b) time_zone_offset: 28800 flags: 226 output_offsets: 0 output_offsets: 1 output_offsets: 2 encode_type: TypeCHBlock sql_mode: 0 time_zone_name: \"Asia/Shanghai\" collect_execution_summaries: false root_executor { tp: TypeTableScan tbl_scan { table_id: 2680 columns { column_id: 1 tp: 3 collation: -63 columnLen: 11 decimal: 0 flag: 4099 pk_handle: false array: false } columns { column_id: 2 tp: 3 collation: -63 columnLen: 11 decimal: 0 flag: 4099 pk_handle: false array: false } columns { column_id: 3 tp: 3 collation: -63 columnLen: 11 decimal: 0 flag: 4099 pk_handle: false array: false } desc: false next_read_engine: Local keep_order: false is_fast_scan: false max_wait_time_ms: 0 } executor_id: \"TableFullScan_43\" } force_encode_type: true"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [PhysicalPlan.cpp:84] ["tidb table scan has runtime filter size:0"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [PhysicalPlan.cpp:299] ["build unoptimized physical plan: \n<Projection, TableFullScan_43> | is_tidb_operator: false, schema: <TableFullScan_43_table_scan_0, Int32>, <TableFullScan_43_table_scan_1, Int32>, <TableFullScan_43_table_scan_2, Int32>\n <TableScan, TableFullScan_43> | is_tidb_operator: true, schema: <table_scan_0, Int32>, <table_scan_1, Int32>, <table_scan_2, Int32>\n"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [PhysicalPlan.cpp:302] ["build optimized physical plan: \n<Projection, TableFullScan_43> | is_tidb_operator: false, schema: <TableFullScan_43_table_scan_0, Int32>, <TableFullScan_43_table_scan_1, Int32>, <TableFullScan_43_table_scan_2, Int32>\n <TableScan, TableFullScan_43> | is_tidb_operator: true, schema: <table_scan_0, Int32>, <table_scan_1, Int32>, <table_scan_2, Int32>\n"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [DEBUG] [LearnerReadWorker.cpp:317] ["[Learner Read] Batch read index, num_regions=1 num_requests=1 num_stale_read=0 num_cached_index=0 num_unavailable=0 cost=0ms"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [INFO] [LearnerReadWorker.cpp:416] ["[Learner Read] Finish wait index and resolve locks, wait_cost=0ms n_regions=1 n_unavailable=1"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]
[2024/01/02 09:40:40.865 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1527]

But tiflash handle the raft log/admin log (region merge) correctly according to the logging.

2912960:[2024/01/02 09:19:47.911 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 84}"] [thread_id=1502]
2912983:[2024/01/02 09:20:02.779 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 84}"] [thread_id=1085]
2913023:[2024/01/02 09:20:03.461 +08:00] [INFO] [RaftCommands.cpp:152] ["[region_id=318] execute prepare merge, min_index 19, target region_id=5921"] [thread_id=62]
2913027:[2024/01/02 09:20:03.462 +08:00] [INFO] [RaftCommands.cpp:244] ["[region_id=5921 applied_term=7 applied_index=5630262] execute admin command CommitMerge at [term: 7, index: 5630263]"] [thread_id=61]
2913028:[2024/01/02 09:20:03.462 +08:00] [INFO] [RaftCommands.cpp:186] ["[region_id=5921] execute commit merge, source region_id=318, commit index=19"] [thread_id=61]
2913029:[2024/01/02 09:20:03.462 +08:00] [INFO] [RaftCommands.cpp:291] ["After execute merge cmd, current region info: [region_id=5921 index=5630263 table_id=2680 ver=85 conf_ver=35 state=Normal peer=id: 5933 store_id: 26 role: Learner]"] [thread_id=61]
2913032:[2024/01/02 09:20:03.462 +08:00] [INFO] [KVStore.cpp:393] ["Start to persist [region_id=5921 index=5630263 table_id=2680 ver=85 conf_ver=35 state=Normal peer=id: 5933 store_id: 26 role: Learner], cache size: 0 bytes for `admin raft cmd `"] [thread_id=61]
2913033:[2024/01/02 09:20:03.462 +08:00] [DEBUG] [KVStore.cpp:395] ["Persist [region_id=5921] done, cache size: 0 bytes"] [thread_id=61]
2913066:[2024/01/02 09:20:17.452 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1500]
2913077:[2024/01/02 09:20:23.481 +08:00] [DEBUG] [KVStore.cpp:557] ["[region_id=5921] approx mem cache info: rows 0, bytes 0, gap 1/200"] [thread_id=61]
2913108:[2024/01/02 09:20:31.789 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1085]
2913130:[2024/01/02 09:20:45.828 +08:00] [WARN] [CoprocessorHandler.cpp:204] ["LockException: region_id=5921, message: Key is locked (1 locks in regions {5921})"] [source="Coprocessor, is_remote_read: true, start_ts: 446734623143100416, region_info: {5921, 35, 85}"] [thread_id=1529]

So this should be an issue that client-c/tidb doesn't resolve locks correctly

JaySon-Huang commented 10 months ago

There are some long-running txn from the tidb's log:

>  rg 'primary_lock' tidb.log | head                                                                                                                                                                                                                                              2024-01-02 17:24:26
[2024/01/02 08:26:39.574 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734039555506176] [storeAddr=tc-tiflash-2.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=22] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\001\\254\\003\\200\\000\\000\\000\\000\\000\\000\\003\"\nlock_version: 446734038936584200\nkey: \"t\\200\\000\\000\\000\\000\\000\\nz_r\\200\\000\\000\\000\\002\\0010U\"\nlock_ttl: 20005\ntxn_size: 1\nlock_for_update_ts: 446734038936584200\nuse_async_commit: true\nmin_commit_ts: 446734038936584282\n, e.what() = DB::Exception,"]
[2024/01/02 08:26:39.584 +08:00] [INFO] [conn.go:1132] ["command dispatched failed"] [conn=3126854430] [session_alias=] [connInfo="id:3126854430, addr:10.200.72.180:55474 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="set @@tidb_read_staleness='-60'; select  c_last, c_id o_id, o_entry_d, o_ol_cnt, sum(ol_amount) from  customer, orders, order_line where  c_id = o_c_id   and c_w_id = o_w_id   and c_d_id = o_d_id   and ol_w_id = o_w_id   and ol_d_id = o_d_id   and ol_o_id = o_id group by o_id, o_w_id, o_d_id, c_id, c_last, o_entry_d, o_ol_cnt having  sum(ol_amount) > 200 order by sum(ol_amount) desc, o_entry_d "] [txn_mode=PESSIMISTIC] [timestamp=0] [err="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\001\\254\\003\\200\\000\\000\\000\\000\\000\\000\\003\"\nlock_version: 446734038936584200\nkey: \"t\\200\\000\\000\\000\\000\\000\\nz_r\\200\\000\\000\\000\\002\\0010U\"\nlock_ttl: 20005\ntxn_size: 1\nlock_for_update_ts: 446734038936584200\nuse_async_commit: true\nmin_commit_ts: 446734038936584282\n, e.what() = DB::Exception,\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleMPPStreamResponse\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:631\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).receiveResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:514\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleDispatchReq\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:467\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).dispatchAll.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:372\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"]
[2024/01/02 08:27:00.882 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734044911370240] [storeAddr=tc-tiflash-1.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=33] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\000\\316\\003\\200\\000\\000\\000\\000\\000\\000\\005\"\nlock_version: 446734038936584240\nkey: \"t\\200\\000\\000\\000\\000\\000\\n~_r\\200\\000\\000\\000\\001$zl\"\nlock_ttl: 20010\ntxn_size: 1\nlock_for_update_ts: 446734038936584240\nuse_async_commit: true\nmin_commit_ts: 446734038936584300\n, e.what() = DB::Exception,"]
[2024/01/02 08:27:00.887 +08:00] [INFO] [conn.go:1132] ["command dispatched failed"] [conn=3126854430] [session_alias=] [connInfo="id:3126854430, addr:10.200.72.180:55474 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="set @@tidb_read_staleness='-60'; select  s_nationkey as supp_nation,            substr(c_state,1,1) as cust_nation,            extract(year from o_entry_d) as l_year,            sum(ol_amount) as revenue from  supplier, stock, order_line, orders, customer, nation n1, nation n2 where  ol_supply_w_id = s_w_id   and ol_i_id = s_i_id   and mod((s_w_id * s_i_id), 10000) = s_suppkey   and ol_w_id = o_w_id   and ol_d_id = o_d_id   and ol_o_id = o_id   and c_id = o_c_id   and c_w_id = o_w_id   and c_d_id = o_d_id   and s_nationkey = n1.n_nationkey   and ascii(substr(c_state,1,1)) - 65 = n2.n_nationkey   and (         (n1.n_name = 'JAPAN' and n2.n_name = 'CHINA')         or         (n1.n_name = 'CHINA' and n2.n_name = 'JAPAN')     )   and ol_delivery_d between '2007-01-02 00:00:00.000000' and '2032-01-02 00:00:00.000000' group by s_nationkey, substr(c_state,1,1), extract(year from o_entry_d) order by s_nationkey, cust_nation, l_year; "] [txn_mode=PESSIMISTIC] [timestamp=0] [err="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\000\\316\\003\\200\\000\\000\\000\\000\\000\\000\\005\"\nlock_version: 446734038936584240\nkey: \"t\\200\\000\\000\\000\\000\\000\\n~_r\\200\\000\\000\\000\\001$zl\"\nlock_ttl: 20010\ntxn_size: 1\nlock_for_update_ts: 446734038936584240\nuse_async_commit: true\nmin_commit_ts: 446734038936584300\n, e.what() = DB::Exception,\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleMPPStreamResponse\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:631\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).receiveResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:514\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleDispatchReq\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:467\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).dispatchAll.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:372\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"]
[2024/01/02 08:27:21.774 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734050495823872] [storeAddr=tc-tiflash-2.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=5] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\002\\373\\003\\200\\000\\000\\000\\000\\000\\000\\006\"\nlock_version: 446734038962798593\nkey: \"t\\200\\000\\000\\000\\000\\000\\n|_r\\200\\000\\000\\000\\023\\351\\036}\"\nlock_ttl: 20006\ntxn_size: 6\nlock_for_update_ts: 446734038962798593\nuse_async_commit: true\nmin_commit_ts: 446734038962798595\n, e.what() = DB::Exception,"]
[2024/01/02 08:27:21.774 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734050495823872] [storeAddr=tc-tiflash-1.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=6] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\002\\373\\003\\200\\000\\000\\000\\000\\000\\000\\006\"\nlock_version: 446734038962798593\nkey: \"t\\200\\000\\000\\000\\000\\000\\n|_r\\200\\000\\000\\000\\023\\351\\036}\"\nlock_ttl: 20006\ntxn_size: 6\nlock_for_update_ts: 446734038962798593\nuse_async_commit: true\nmin_commit_ts: 446734038962798595\n, e.what() = DB::Exception,"]
[2024/01/02 08:27:21.774 +08:00] [WARN] [local_mpp_coordinator.go:635] ["other error"] [txnStartTS=446734050495823872] [storeAddr=tc-tiflash-3.tc-tiflash-peer.endless-htap-stale-read-tps-5370694-1-713.svc:3930] [mpp-version=2] [task-id=8] [error="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\002\\373\\003\\200\\000\\000\\000\\000\\000\\000\\006\"\nlock_version: 446734038962798593\nkey: \"t\\200\\000\\000\\000\\000\\000\\n|_r\\200\\000\\000\\000\\023\\351\\036}\"\nlock_ttl: 20006\ntxn_size: 6\nlock_for_update_ts: 446734038962798593\nuse_async_commit: true\nmin_commit_ts: 446734038962798595\n, e.what() = DB::Exception,"]
[2024/01/02 08:27:21.775 +08:00] [INFO] [conn.go:1132] ["command dispatched failed"] [conn=3126854430] [session_alias=] [connInfo="id:3126854430, addr:10.200.72.180:55474 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="set @@tidb_read_staleness='-60'; select   ol_number,          sum(ol_quantity) as sum_qty,          sum(ol_amount) as sum_amount,          avg(ol_quantity) as avg_qty,          avg(ol_amount) as avg_amount,          count(*) as count_order from  order_line where  ol_delivery_d > '2007-01-02 00:00:00.000000' group by ol_number order by ol_number; "] [txn_mode=PESSIMISTIC] [timestamp=0] [err="other error for mpp stream: Code: 0, e.displayText() = DB::Exception: primary_lock: \"t\\200\\000\\000\\000\\000\\000\\nr_i\\200\\000\\000\\000\\000\\000\\000\\001\\003\\200\\000\\000\\000\\000\\000\\002\\373\\003\\200\\000\\000\\000\\000\\000\\000\\006\"\nlock_version: 446734038962798593\nkey: \"t\\200\\000\\000\\000\\000\\000\\n|_r\\200\\000\\000\\000\\023\\351\\036}\"\nlock_ttl: 20006\ntxn_size: 6\nlock_for_update_ts: 446734038962798593\nuse_async_commit: true\nmin_commit_ts: 446734038962798595\n, e.what() = DB::Exception,\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleMPPStreamResponse\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:631\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).receiveResults\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:514\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).handleDispatchReq\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:467\ngithub.com/pingcap/tidb/pkg/executor/internal/mpp.(*localMppCoordinator).dispatchAll.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/internal/mpp/local_mpp_coordinator.go:372\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"]

>  rg '446734038949691420' tidb.log
29855:[2024/01/02 09:03:19.235 +08:00] [WARN] [expensivequery.go:145] [expensive_query] [cost_time=2282.358999937s] [conn=3126854432] [user=root] [database=tpcc] [txn_start_ts=446734038949691420] [mem_max="16384 Bytes (16 KB)"] [sql=COMMIT] [session_alias=]
29873:[2024/01/02 09:03:24.734 +08:00] [INFO] [region_request.go:1532] ["throwing pseudo region error due to no replica available"] [conn=3126854432] [session_alias=] [req-ts=446734038949691420] [req-type=Prewrite] [region="{ region id: 6011, ver: 101, confVer: 21 }"] [replica-read-type=leader] [stale-read=false] [request-sender="{rpcError:wait recvLoop timeout,timeout:30s, wait_duration:1.187µs:: context deadline exceeded,replicaSelector: replicaSelector{selectorStateStr: accessKnownLeader, cacheRegionIsValid: false, replicaStatus: [peer: 6012, store: 1, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 6013, store: 14, isEpochStale: false, attempts: 10, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 6014, store: 5, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [retry-times=14] [total-backoff-ms=597704] [total-backoff-times=90] [max-exec-timeout-ms=20000] [total-region-errors="6013-deadline_exceeded:12, 6014-not_leader-6013:1, 6012-not_leader-6013:1"]
29874:[2024/01/02 09:03:24.863 +08:00] [WARN] [prewrite.go:291] ["slow prewrite request"] [startTS=446734038949691420] [region="{ region id: 6011, ver: 101, confVer: 21 }"] [attempts=8]
29922:[2024/01/02 09:03:26.868 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2310000]
29980:[2024/01/02 09:03:36.867 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2320000]
29994:[2024/01/02 09:03:46.867 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2330000]
30025:[2024/01/02 09:03:56.867 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2340000]
30034:[2024/01/02 09:04:06.867 +08:00] [INFO] [2pc.go:1215] ["send TxnHeartBeat"] [startTS=446734038949691420] [newTTL=2350000]
30039:[2024/01/02 09:04:10.911 +08:00] [INFO] [region_request.go:1532] ["send request on region error failed, err: region unavailable"] [conn=3126854432] [session_alias=] [req-ts=446734038949691420] [req-type=Prewrite] [region="{ region id: 6011, ver: 101, confVer: 21 }"] [replica-read-type=leader] [stale-read=false] [request-sender="{rpcError:wait recvLoop timeout,timeout:30s, wait_duration:1.187µs:: context deadline exceeded,replicaSelector: replicaSelector{selectorStateStr: accessKnownLeader, cacheRegionIsValid: true, replicaStatus: [peer: 6012, store: 1, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 6013, store: 14, isEpochStale: false, attempts: 2, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 6014, store: 5, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [retry-times=1] [total-backoff-ms=603876] [total-backoff-times=92] [max-exec-timeout-ms=20000] [total-region-errors=6013-deadline_exceeded:2]
30040:[2024/01/02 09:04:10.911 +08:00] [ERROR] [2pc.go:1544] ["Async commit/1PC result undetermined"] [conn=3126854432] [session_alias=] [error="region unavailable"] [rpcErr="wait recvLoop timeout,timeout:30s, wait_duration:1.187µs:: context deadline exceeded"] [txnStartTS=446734038949691420]
gengliqi commented 10 months ago

The root cause is that one apply thread of one TiKV got stuck. So resolve lock rpc got stuck too. TiKV has fixed this issue https://github.com/tikv/tikv/issues/16340.