pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
948 stars 410 forks source link

GRpc No match found for server name cause TiFlash crash #8535

Open aytrack opened 11 months ago

aytrack commented 11 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

[2023/12/13 03:09:58.571 +00:00] [ERROR] [MPPTask.cpp:469] ["task running meets error: Code: 49, e.displayText() = DB::Exception: Check tunnel != nullptr failed: tunnel4+10: Can't find task [446283641820020850,4] within 10s., e.what() = DB::Exception, Stack trace:\n\n\n       0x1c54d3e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29707582]\n                \tdbms/src/Common/Exception.h:46\n       0x82636b6\tDB::GRPCReceiverContext::establishMPPConnectionLocalV2(DB::ExchangeRecvRequest const&, unsigned long, DB::LocalRequestHandler&, bool, bool) [tiflash+136722102]\n                \tdbms/src/Flash/Mpp/GRPCReceiverContext.cpp:296\n       0x1c14f84\tDB::ExchangeReceiverBase<DB::GRPCReceiverContext>::ExchangeReceiverBase(std::__1::shared_ptr<DB::GRPCReceiverContext>, unsigned long, unsigned long, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long, int, std::__1::vector<std::__1::tuple<std::__1::shared_ptr<mpp::DispatchTaskRequest>, std::__1::vector<pingcap::kv::RegionVerID, std::__1::allocator<pingcap::kv::RegionVerID> >, unsigned long>, std::__1::allocator<std::__1::tuple<std::__1::shared_ptr<mpp::DispatchTaskRequest>, std::__1::vector<pingcap::kv::RegionVerID, std::__1::allocator<pingcap::kv::RegionVerID> >, unsigned long> > > const&) [tiflash+29446020]\n                \tdbms/src/Flash/Mpp/ExchangeReceiver.cpp:338\n       0x827e991\tstd::__1::__function::__func<DB::MPPTask::initExchangeReceivers()::$_18, std::__1::allocator<DB::MPPTask::initExchangeReceivers()::$_18>, bool (tipb::Executor const&)>::operator()(tipb::Executor const&) [tiflash+136833425]\n                \t/usr/local/bin/../include/c++/v1/__functional/function.h:345\n       0x8189682\tvoid DB::traverseExecutorTree<std::__1::function<bool (tipb::Executor const&)>&>(tipb::Executor const&, std::__1::function<bool (tipb::Executor const&)>&)::'lambda'(tipb::Executor const&)::operator()(tipb::Executor const&) const [tiflash+135829122]\n                \tdbms/src/Flash/Statistics/traverseExecutors.h:79\n       0x8189186\tvoid DB::traverseExecutors<std::__1::function<bool (tipb::Executor const&)> >(tipb::DAGRequest const*, std::__1::function<bool (tipb::Executor const&)>&&) [tiflash+135827846]\n                \tdbms/src/Flash/Statistics/traverseExecutors.h:94\n       0x8279898\tDB::MPPTask::runImpl() [tiflash+136812696]\n                \tdbms/src/Flash/Mpp/MPPTask.cpp:389\n       0x1d60168\tauto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::'lambda'()::operator()() [tiflash+30802280]\n                \tdbms/src/Common/wrapInvocable.h:36\n       0x1c64c53\tDB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&) [tiflash+29772883]\n                \tdbms/src/Common/DynamicThreadPool.cpp:101\n       0x1c642b4\tDB::DynamicThreadPool::fixedWork(unsigned long) [tiflash+29770420]\n                \tdbms/src/Common/DynamicThreadPool.cpp:115\n       0x1c657c2\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::*)(unsigned long), DB::DynamicThreadPool*, unsigned long&>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, void (DB::DynamicThreadPool::*&&)(unsigned long), DB::DynamicThreadPool*&&, unsigned long&)::'lambda'(auto&&...), DB::DynamicThreadPool*, unsigned long> >(void*) [tiflash+29775810]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f1b09d5a802\tstart_thread [libc.so.6+653314]\n  0x7f1b09cfa450\t__GI___clone3 [libc.so.6+259152]"] [source="MPP<query:<query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>,task_id:10>"] [thread_id=5707]
[2023/12/13 03:09:58.571 +00:00] [WARN] [MPPTaskManager.cpp:155] ["Begin to abort query: <query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>, abort type: ONERROR, reason: From MPP<query:<query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>,task_id:10>: Code: 49, e.displayText() = DB::Exception: Check tunnel != nullptr failed: tunnel4+10: Can't find task [446283641820020850,4] within 10s., e.what() = DB::Exception,"] [thread_id=5707]
[2023/12/13 03:09:58.571 +00:00] [WARN] [MPPTask.cpp:511] ["Begin abort task: MPP<query:<query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>,task_id:10>, abort type: ONERROR"] [source="MPP<query:<query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>,task_id:10>"] [thread_id=5707]
[2023/12/13 03:10:47.239 +00:00] [ERROR] [MPPTask.cpp:469] ["task running meets error: Code: 0, e.displayText() = DB::Exception: tunnel4+-1 is timeout, e.what() = DB::Exception, Stack trace:\n\n\n       0x1c54d3e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29707582]\n                \tdbms/src/Common/Exception.h:46\n       0x829220e\tDB::MPPTunnel::waitUntilConnectedOrFinished(std::__1::unique_lock<std::__1::mutex>&) [tiflash+136913422]\n                \tdbms/src/Flash/Mpp/MPPTunnel.cpp:343\n       0x8291918\tDB::MPPTunnel::write(std::__1::shared_ptr<DB::TrackedMppDataPacket>&&) [tiflash+136911128]\n                \tdbms/src/Flash/Mpp/MPPTunnel.cpp:163\n       0x1c2406b\tDB::MPPTunnelSetBase<DB::MPPTunnel>::sendExecutionSummary(tipb::SelectResponse const&) [tiflash+29507691]\n                \tdbms/src/Flash/Mpp/MPPTunnelSet.cpp:39\n       0x827af34\tDB::MPPTask::runImpl() [tiflash+136818484]\n                \tdbms/src/Flash/Mpp/MPPTask.cpp:421\n       0x1d60168\tauto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::'lambda'()::operator()() [tiflash+30802280]\n                \tdbms/src/Common/wrapInvocable.h:36\n       0x1c64c53\tDB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&) [tiflash+29772883]\n                \tdbms/src/Common/DynamicThreadPool.cpp:101\n       0x1c642b4\tDB::DynamicThreadPool::fixedWork(unsigned long) [tiflash+29770420]\n                \tdbms/src/Common/DynamicThreadPool.cpp:115\n       0x1c657c2\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::*)(unsigned long), DB::DynamicThreadPool*, unsigned long&>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, void (DB::DynamicThreadPool::*&&)(unsigned long), DB::DynamicThreadPool*&&, unsigned long&)::'lambda'(auto&&...), DB::DynamicThreadPool*, unsigned long> >(void*) [tiflash+29775810]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f1b09d5a802\tstart_thread [libc.so.6+653314]\n  0x7f1b09cfa450\t__GI___clone3 [libc.so.6+259152]"] [source="MPP<query:<query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>,task_id:4>"] [thread_id=6468]
[2023/12/13 03:10:47.239 +00:00] [WARN] [MPPTaskManager.cpp:155] ["Begin to abort query: <query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>, abort type: ONERROR, reason: From MPP<query:<query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>,task_id:4>: Code: 0, e.displayText() = DB::Exception: tunnel4+-1 is timeout, e.what() = DB::Exception,"] [thread_id=6468]
[2023/12/13 03:10:47.239 +00:00] [WARN] [MPPTask.cpp:511] ["Begin abort task: MPP<query:<query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>,task_id:4>, abort type: ONERROR"] [source="MPP<query:<query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>,task_id:4>"] [thread_id=6468]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5804) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x7f1970bfbb"] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:413] ["Access: read."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.600 +00:00] [ERROR] [Server.cpp:379] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: tc-tiflash-10.tc-tiflash-peer.comp-tiflash-tps-5131024-1-679.svc."] [source=grpc] [thread_id=5852]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:569] ["\n       0x7554c51\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+123030609]\n                \tlibs/libdaemon/src/BaseDaemon.cpp:220\n  0x7f1b09d0fdb0\t<unknown symbol> [libc.so.6+347568]\n       0x8ccee29\tbio_ctrl [tiflash+147648041]\n                \tcontrib/boringssl/crypto/bio/pair.c:411\n       0x8ccd3f5\tBIO_pending [tiflash+147641333]\n                \tcontrib/boringssl/crypto/bio/bio.c:312\n       0x8c586ac\tssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+147162796]\n                \tcontrib/grpc/src/core/tsi/ssl_transport_security.cc:1568\n       0x8b0568b\tgrpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+145774219]\n                \tcontrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463\n       0x8aa3141\tpollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+145371457]\n                \tcontrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139\n       0x8b20fae\tcq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+145887150]\n                \tcontrib/grpc/src/core/lib/surface/completion_queue.cc:1047\n       0x1e8af99\tDB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+32026521]\n                \tdbms/src/Server/FlashGrpcServerHolder.cpp:52\n       0x1e8ab7d\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+32025469]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f1b09d5a802\tstart_thread [libc.so.6+653314]"] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5852) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:413] ["Access: read."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.609 +00:00] [ERROR] [BaseDaemon.cpp:569] ["\n       0x7554c51\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+123030609]\n                \tlibs/libdaemon/src/BaseDaemon.cpp:220\n  0x7f1b09d0fdb0\t<unknown symbol> [libc.so.6+347568]\n       0x8c57503\tserver_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+147158275]\n                \tcontrib/grpc/src/core/tsi/ssl_transport_security.cc:1884\n       0x8c803ae\tbssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+147325870]\n                \tcontrib/boringssl/ssl/extensions.cc:1570\n       0x8cbf426\tbssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+147584038]\n                \tcontrib/boringssl/ssl/tls13_server.cc:1242\n       0x8c9e8dc\tbssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+147450076]\n                \tcontrib/boringssl/ssl/handshake_server.cc:1835\n       0x8c9018a\tbssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+147390858]\n                \tcontrib/boringssl/ssl/handshake.cc:738\n       0x8caae2a\tSSL_do_handshake [tiflash+147500586]\n                \tcontrib/boringssl/ssl/ssl_lib.cc:841\n       0x8c58680\tssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+147162752]\n                \tcontrib/grpc/src/core/tsi/ssl_transport_security.cc:1568\n       0x8b0568b\tgrpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+145774219]\n                \tcontrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463\n       0x8aa3141\tpollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+145371457]\n                \tcontrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139\n       0x8b20fae\tcq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+145887150]\n                \tcontrib/grpc/src/core/lib/surface/completion_queue.cc:1047\n       0x1e8af99\tDB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+32026521]\n                \tdbms/src/Server/FlashGrpcServerHolder.cpp:52\n       0x1e8c05d\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+32030813]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f1b09d5a802\tstart_thread [libc.so.6+653314]"] [source=BaseDaemon] [thread_id=6514]

similar case #7024 updated: not similar

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

v7.1.3

gengliqi commented 8 months ago

Another same case from one customer

[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 4908) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.234 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.254 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
       0x68b831c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
                    libs/libdaemon/src/BaseDaemon.cpp:220
  0xffffbbe3881c    <unknown symbol> [linux-vdso.so.1+2076]
       0x7d59d20    server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131439904]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
       0x7d80490    bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131597456]
                    contrib/boringssl/ssl/extensions.cc:1570
       0x7dbae28    bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131837480]
                    contrib/boringssl/ssl/tls13_server.cc:1242
       0x7d9d050    bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131715152]
                    contrib/boringssl/ssl/handshake_server.cc:1835
       0x7d90104    bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131662084]
                    contrib/boringssl/ssl/handshake.cc:738
       0x7da810c    SSL_do_handshake [tiflash+131760396]
                    contrib/boringssl/ssl/ssl_lib.cc:841
       0x7d5ae5c    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444316]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
       0x7c21db0    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7bbf500    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7c3afb8    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x1b0e8c0    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b0f6fc    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28374780]
                    /usr/local/bin/../include/c++/v1/thread:291
  0xffffb7da6a28    start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=258943]
JaySon-Huang commented 8 months ago

Seems this is an issue with the ssl lib. And we can't locate the root cause of it so far.

JaySon-Huang commented 8 months ago

Another similar case observed

[2024/03/26 16:51:20.125 +00:00] [ERROR] [Server.cpp:379] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-3.db-tiflash-peer.tidb1379xxx5254.svc."] [source=grpc] [thread_id=5588]
[2024/03/26 16:51:20.125 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5588) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=11240]
  0xffff94bc0a28    start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=11240]
                    /usr/local/bin/../include/c++/v1/thread:291
       0x1b0f6fc    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28374780]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b0e8c0    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x7c3afb8    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7bbf500    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7c21db0    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
       0x7d5ae5c    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444316]
                    contrib/boringssl/ssl/ssl_lib.cc:841
       0x7da810c    SSL_do_handshake [tiflash+131760396]
                    contrib/boringssl/ssl/handshake.cc:738
       0x7d90104    bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131662084]
                    contrib/boringssl/ssl/handshake_server.cc:1835
       0x7d9d050    bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131715152]
                    contrib/boringssl/ssl/tls13_server.cc:1242
       0x7dbae28    bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131837480]
                    contrib/boringssl/ssl/extensions.cc:1570
       0x7d80490    bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131597456]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
       0x7d59d20    server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131439904]
  0xffff98c52800    <unknown symbol> [linux-vdso.so.1+2048]
                    libs/libdaemon/src/BaseDaemon.cpp:220
       0x68b831c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5322) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0xfff52c58f2"] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=11240]
  0xffff94bc0a28    start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=11240]
                    /usr/local/bin/../include/c++/v1/thread:291
       0x1b0e444    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+28369988]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b0e8c0    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x7c3afb8    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7bbf500    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7c21db0    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1576
       0x7d5aee8    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444456]
                    contrib/boringssl/crypto/bio/pair.c:158
       0x7dc976c    bio_read [tiflash+131897196]
  0xffff98c52800    <unknown symbol> [linux-vdso.so.1+2048]
                    libs/libdaemon/src/BaseDaemon.cpp:220
       0x68b831c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
[2024/03/26 16:51:20.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
JaySon-Huang commented 6 months ago

Each time we can see a similar logging like .../grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-0.db-tiflash-peer.tidb?????????.svc."]

[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
       0x68c2c3c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109849660]
                    libs/libdaemon/src/BaseDaemon.cpp:220
  0xffffb0044820    <unknown symbol> [linux-vdso.so.1+2080]
       0x7de1824    bio_ctrl [tiflash+131995684]
                    contrib/boringssl/crypto/bio/pair.c:411
       0x7ddfb3c    BIO_pending [tiflash+131988284]
                    contrib/boringssl/crypto/bio/bio.c:312
       0x7d72d94    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131542420]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
       0x7c39cbc    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130260156]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7bd740c    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129856524]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7c52ec4    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130363076]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x1b14598    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28394904]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b153d4    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28398548]
                    /usr/local/bin/../include/c++/v1/thread:291
  0xffffabfb2a38    start_thread [libc.so.6+535096]"] [source=BaseDaemon] [thread_id=299017]Show context
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0xffdf201318"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 2417) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
       0x68c2c3c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109849660]
                    libs/libdaemon/src/BaseDaemon.cpp:220
  0xffffb0044820    <unknown symbol> [linux-vdso.so.1+2080]
       0x7d71c30    server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131537968]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
       0x7d983a0    bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131695520]
                    contrib/boringssl/ssl/extensions.cc:1570
       0x7dd2d38    bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131935544]
                    contrib/boringssl/ssl/tls13_server.cc:1242
       0x7db4f60    bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131813216]
                    contrib/boringssl/ssl/handshake_server.cc:1835
       0x7da8014    bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131760148]
                    contrib/boringssl/ssl/handshake.cc:738
       0x7dc001c    SSL_do_handshake [tiflash+131858460]
                    contrib/boringssl/ssl/ssl_lib.cc:841
       0x7d72d6c    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131542380]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
       0x7c39cbc    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130260156]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7bd740c    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129856524]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7c52ec4    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130363076]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x1b14598    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28394904]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b1411c    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+28393756]
                    /usr/local/bin/../include/c++/v1/thread:291
  0xffffabfb2a38    start_thread [libc.so.6+535096]"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 2787) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [Server.cpp:379] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-0.db-tiflash-peer.tidb?????????.svc."] [source=grpc] [thread_id=2787]
windtalker commented 6 months ago

The issue is introduced by #6346, and the root cause is there is data race in grpc-core when update ssh cert online: https://github.com/grpc/grpc/pull/22647#pullrequestreview-391930235

gengliqi commented 6 months ago

The bug has been located in gRPC core. See https://github.com/grpc/grpc/issues/36693.

windtalker commented 1 day ago

As comment in https://github.com/pingcap/tiflash/issues/8535#issuecomment-2123931913, the root cause of this issue is the data race in grpc. Before https://github.com/pingcap/tiflash/pull/9071, TiFlash will reload ssl cert every 2 seconds, even if the ssl cert is not updated, so TiFlash may meet this issue with a certain probability, and after https://github.com/pingcap/tiflash/pull/9071, TiFlash only reload ssl cert if they are actually updated. Since update ssl cert is usually a very low frequency events(maybe just once a year), the issue can be considered as "99.99% fixed", but there is still a very low chance of being triggered. So I left this issue as not fixed and change the severity to minor.