pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
937 stars 409 forks source link

Async tunnel received establish connection request twice #8492

Open yibin87 opened 7 months ago

yibin87 commented 7 months ago

Record logs from regression test, commit hash 308541405f4e6f78cee2ff0b05de2209cc168000: [2023/12/07 16:54:39.275 +08:00] [DEBUG] [MPPTask.cpp:240] ["begin to register the tunnel tunnel11+13, is_local: false, is_async: true"] [source="MPP<gather_id:1, query_ts:1701939279255873937, local_query_id:9887, server_id:402, start_ts:446153170416566277,task_id:11>"] [thread_id=543] [2023/12/07 16:54:39.306 +08:00] [INFO] [FlashService.cpp:599] ["Handling establish mpp connection request: sender_meta {\n start_ts: 446153170416566277\n task_id: 11\n address: \"tc-tiflash-0.tc-tiflash-peer.endless-htap-ch-feature-matrix-tps-5040417-1-792.svc:3930\"\n gather_id: 1\n query_ts: 1701939279255873937\n local_query_id: 9887\n server_id: 402\n mpp_version: 2\n connection_id: 843062682\n}\nreceiver_meta {\n start_ts: 446153170416566277\n task_id: 13\n address: \"tc-tiflash-1.tc-tiflash-peer.endless-htap-ch-feature-matrix-tps-5040417-1-792.svc:3930\"\n gather_id: 1\n query_ts: 1701939279255873937\n local_query_id: 9887\n server_id: 402\n mpp_version: 2\n keyspace_id: 4294967295\n coordinator_address: \"tc-tidb-1.tc-tidb-peer.endless-htap-ch-feature-matrix-tps-5040417-1-792.svc:10080\"\n resource_group_name: \"default\"\n connection_id: 843062682\n}\n"] [source=FlashService] [thread_id=511] [2023/12/07 16:54:39.308 +08:00] [DEBUG] [MPPTunnel.cpp:323] ["Async tunnel connected in async mode"] [source="MPP<gather_id:1, query_ts:1701939279255873937, local_query_id:9887, server_id:402, start_ts:446153170416566277,task_id:11> tunnel11+13"] [thread_id=702] [2023/12/07 16:54:42.892 +08:00] [INFO] [EstablishCall.cpp:250] ["async connection for tunnel11+13 cost 3586 ms, including 0 ms to wait task."] [source="MPP<gather_id:1, query_ts:1701939279255873937, local_query_id:9887, server_id:402, start_ts:446153170416566277,task_id:11> tunnel11+13"] [thread_id=721] [2023/12/07 16:54:42.892 +08:00] [DEBUG] [MPPTask.cpp:585] ["finish with 3.54297196 seconds, 10003 rows, 1 blocks, 470141 bytes"] [source="MPP<gather_id:1, query_ts:1701939279255873937, local_query_id:9887, server_id:402, start_ts:446153170416566277,task_id:11>"] [thread_id=701] [2023/12/07 16:54:43.879 +08:00] [INFO] [FlashService.cpp:599] ["Handling establish mpp connection request: sender_meta {\n start_ts: 446153170416566277\n task_id: 11\n address: \"tc-tiflash-0.tc-tiflash-peer.endless-htap-ch-feature-matrix-tps-5040417-1-792.svc:3930\"\n gather_id: 1\n query_ts: 1701939279255873937\n local_query_id: 9887\n server_id: 402\n mpp_version: 2\n connection_id: 843062682\n}\nreceiver_meta {\n start_ts: 446153170416566277\n task_id: 13\n address: \"tc-tiflash-1.tc-tiflash-peer.endless-htap-ch-feature-matrix-tps-5040417-1-792.svc:3930\"\n gather_id: 1\n query_ts: 1701939279255873937\n local_query_id: 9887\n server_id: 402\n mpp_version: 2\n keyspace_id: 4294967295\n coordinator_address: \"tc-tidb-1.tc-tidb-peer.endless-htap-ch-feature-matrix-tps-5040417-1-792.svc:10080\"\n resource_group_name: \"default\"\n connection_id: 843062682\n}\n"] [source=FlashService] [thread_id=722] [2023/12/07 16:54:43.879 +08:00] [WARN] [EstablishCall.cpp:273] ["EstablishCallData finishes without connected, time cost 0ms, query id: <query_ts:1701939279255873937, local_query_id:9887, server_id:402, start_ts:446153170416566277>, connection id: tunnel11+13"] [source=EstablishCallData] [thread_id=723] [2023/12/07 16:54:46.840 +08:00] [ERROR] [MPPTask.cpp:636] ["task running meets error: Code: 0, e.displayText() = DB::Exception: write to tunnel tunnel11+13 which is already closed, tunnel11+13: unexpectedWriteDone called, e.what() = DB::Exception, Stack trace:\n\n\n 0x1ee65fe\tDB::Exception::Exception(std::1::basic_string<char, std::__1::char_traits, std::1::allocator > const&, int) [tiflash+32400894]\n \tdbms/src/Common/Exception.h:46\n 0x88ef47d\tDB::MPPTunnel::write(std::1::shared_ptr&&) [tiflash+143586429]\n \tdbms/src/Flash/Mpp/MPPTunnel.cpp:181\n 0x1ea1d9b\tDB::MPPTunnelSetBase::sendExecutionSummary(tipb::SelectResponse const&) [tiflash+32120219]\n \tdbms/src/Flash/Mpp/MPPTunnelSet.cpp:39\n 0x88cf9f6\tDB::MPPTask::runImpl() [tiflash+143456758]\n \tdbms/src/Flash/Mpp/MPPTask.cpp:597\n 0x20011c8\tauto DB::wrapInvocable<std::1::function<void ()> >(bool, std::1::function<void ()>&&)::'lambda'()::operator()() [tiflash+33558984]\n \tdbms/src/Common/wrapInvocable.h:36\n 0x1ef6273\tDB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::1::default_delete >&) [tiflash+32465523]\n \tdbms/src/Common/DynamicThreadPool.cpp:124\n 0x1ef5c01\tDB::DynamicThreadPool::fixedWork(unsigned long) [tiflash+32463873]\n \tdbms/src/Common/DynamicThreadPool.cpp:138\n 0x1ef78e2\tvoid std::1::thread_proxy<std::1::tuple<std::1::unique_ptr<std::1::thread_struct, std::1::default_delete >, std::1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::)(unsigned long), DB::DynamicThreadPool, unsigned long&>(bool, std::1::basic_string<char, std::__1::char_traits, std::1::allocator >, void (DB::DynamicThreadPool::&&)(unsigned long), DB::DynamicThreadPool&&, unsigned long&)::'lambda'(auto&&...), DB::DynamicThreadPool, unsigned long> >(void*) [tiflash+32471266]\n \t/usr/local/bin/../include/c++/v1/thread:291\n 0x7f879ff42802\tstart_thread [libc.so.6+653314]\n 0x7f879fee2450\t__clone3 [libc.so.6+259152]"] [source="MPP<gather_id:1, query_ts:1701939279255873937, local_query_id:9887, server_id:402, start_ts:446153170416566277,task_id:11>"] [thread_id=701]

yibin87 commented 7 months ago

Task 13 log: [2023/12/07 16:54:42.875 +08:00] [WARN] [AsyncRequestHandler.h:238] ["Finish fail. err code: UNAVAILABLE, err msg: Connection reset by peer, retry time 0"] [source="MPP<gather_id:1, query_ts:1701939279255873937, local_query_id:9887, server_id:402, start_ts:446153170416566277,task_id:13> ExchangeReceiver_168 async tunnel11+13"] [thread_id=578]

yibin87 commented 7 months ago

Issue is triggered when all the following conditions are met:

  1. Exchange receiver established connection to sender successfully
  2. Received no data from sender
  3. Something was wrong with connection