Open aytrack opened 11 months ago
Another same case from one customer
[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 4908) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.234 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.254 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
0x68b831c faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
libs/libdaemon/src/BaseDaemon.cpp:220
0xffffbbe3881c <unknown symbol> [linux-vdso.so.1+2076]
0x7d59d20 server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131439904]
contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
0x7d80490 bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131597456]
contrib/boringssl/ssl/extensions.cc:1570
0x7dbae28 bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131837480]
contrib/boringssl/ssl/tls13_server.cc:1242
0x7d9d050 bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131715152]
contrib/boringssl/ssl/handshake_server.cc:1835
0x7d90104 bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131662084]
contrib/boringssl/ssl/handshake.cc:738
0x7da810c SSL_do_handshake [tiflash+131760396]
contrib/boringssl/ssl/ssl_lib.cc:841
0x7d5ae5c ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444316]
contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
0x7c21db0 grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
0x7bbf500 pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
0x7c3afb8 cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
0x1b0e8c0 DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
dbms/src/Server/FlashGrpcServerHolder.cpp:52
0x1b0f6fc void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28374780]
/usr/local/bin/../include/c++/v1/thread:291
0xffffb7da6a28 start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=258943]
Seems this is an issue with the ssl lib. And we can't locate the root cause of it so far.
Another similar case observed
[2024/03/26 16:51:20.125 +00:00] [ERROR] [Server.cpp:379] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-3.db-tiflash-peer.tidb1379xxx5254.svc."] [source=grpc] [thread_id=5588]
[2024/03/26 16:51:20.125 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5588) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=11240]
0xffff94bc0a28 start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=11240]
/usr/local/bin/../include/c++/v1/thread:291
0x1b0f6fc void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28374780]
dbms/src/Server/FlashGrpcServerHolder.cpp:52
0x1b0e8c0 DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
0x7c3afb8 cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
0x7bbf500 pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
0x7c21db0 grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
0x7d5ae5c ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444316]
contrib/boringssl/ssl/ssl_lib.cc:841
0x7da810c SSL_do_handshake [tiflash+131760396]
contrib/boringssl/ssl/handshake.cc:738
0x7d90104 bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131662084]
contrib/boringssl/ssl/handshake_server.cc:1835
0x7d9d050 bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131715152]
contrib/boringssl/ssl/tls13_server.cc:1242
0x7dbae28 bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131837480]
contrib/boringssl/ssl/extensions.cc:1570
0x7d80490 bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131597456]
contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
0x7d59d20 server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131439904]
0xffff98c52800 <unknown symbol> [linux-vdso.so.1+2048]
libs/libdaemon/src/BaseDaemon.cpp:220
0x68b831c faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5322) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0xfff52c58f2"] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=11240]
0xffff94bc0a28 start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=11240]
/usr/local/bin/../include/c++/v1/thread:291
0x1b0e444 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+28369988]
dbms/src/Server/FlashGrpcServerHolder.cpp:52
0x1b0e8c0 DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
0x7c3afb8 cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
0x7bbf500 pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
0x7c21db0 grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
contrib/grpc/src/core/tsi/ssl_transport_security.cc:1576
0x7d5aee8 ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444456]
contrib/boringssl/crypto/bio/pair.c:158
0x7dc976c bio_read [tiflash+131897196]
0xffff98c52800 <unknown symbol> [linux-vdso.so.1+2048]
libs/libdaemon/src/BaseDaemon.cpp:220
0x68b831c faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
[2024/03/26 16:51:20.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
Each time we can see a similar logging like .../grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-0.db-tiflash-peer.tidb?????????.svc."]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
0x68c2c3c faultSignalHandler(int, siginfo_t*, void*) [tiflash+109849660]
libs/libdaemon/src/BaseDaemon.cpp:220
0xffffb0044820 <unknown symbol> [linux-vdso.so.1+2080]
0x7de1824 bio_ctrl [tiflash+131995684]
contrib/boringssl/crypto/bio/pair.c:411
0x7ddfb3c BIO_pending [tiflash+131988284]
contrib/boringssl/crypto/bio/bio.c:312
0x7d72d94 ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131542420]
contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
0x7c39cbc grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130260156]
contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
0x7bd740c pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129856524]
contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
0x7c52ec4 cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130363076]
contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
0x1b14598 DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28394904]
dbms/src/Server/FlashGrpcServerHolder.cpp:52
0x1b153d4 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28398548]
/usr/local/bin/../include/c++/v1/thread:291
0xffffabfb2a38 start_thread [libc.so.6+535096]"] [source=BaseDaemon] [thread_id=299017]Show context
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0xffdf201318"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 2417) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
0x68c2c3c faultSignalHandler(int, siginfo_t*, void*) [tiflash+109849660]
libs/libdaemon/src/BaseDaemon.cpp:220
0xffffb0044820 <unknown symbol> [linux-vdso.so.1+2080]
0x7d71c30 server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131537968]
contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
0x7d983a0 bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131695520]
contrib/boringssl/ssl/extensions.cc:1570
0x7dd2d38 bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131935544]
contrib/boringssl/ssl/tls13_server.cc:1242
0x7db4f60 bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131813216]
contrib/boringssl/ssl/handshake_server.cc:1835
0x7da8014 bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131760148]
contrib/boringssl/ssl/handshake.cc:738
0x7dc001c SSL_do_handshake [tiflash+131858460]
contrib/boringssl/ssl/ssl_lib.cc:841
0x7d72d6c ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131542380]
contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
0x7c39cbc grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130260156]
contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
0x7bd740c pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129856524]
contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
0x7c52ec4 cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130363076]
contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
0x1b14598 DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28394904]
dbms/src/Server/FlashGrpcServerHolder.cpp:52
0x1b1411c void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+28393756]
/usr/local/bin/../include/c++/v1/thread:291
0xffffabfb2a38 start_thread [libc.so.6+535096]"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 2787) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [Server.cpp:379] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-0.db-tiflash-peer.tidb?????????.svc."] [source=grpc] [thread_id=2787]
The issue is introduced by #6346, and the root cause is there is data race in grpc-core when update ssh cert online: https://github.com/grpc/grpc/pull/22647#pullrequestreview-391930235
The bug has been located in gRPC core. See https://github.com/grpc/grpc/issues/36693.
As comment in https://github.com/pingcap/tiflash/issues/8535#issuecomment-2123931913, the root cause of this issue is the data race in grpc. Before https://github.com/pingcap/tiflash/pull/9071, TiFlash will reload ssl cert every 2 seconds, even if the ssl cert is not updated, so TiFlash may meet this issue with a certain probability, and after https://github.com/pingcap/tiflash/pull/9071, TiFlash only reload ssl cert if they are actually updated. Since update ssl cert is usually a very low frequency events(maybe just once a year), the issue can be considered as "99.99% fixed", but there is still a very low chance of being triggered. So I left this issue as not fixed and change the severity to minor.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
similar case #7024updated: not similar2. What did you expect to see? (Required)
3. What did you see instead (Required)
4. What is your TiFlash version? (Required)
v7.1.3