Open AkiraXie opened 1 year ago
/severity critical
@AkiraXie Is TLS enabled?
This is a crash from grpc-core, we have met some random crash from grpc-core before, such as https://github.com/pingcap/tiflash/issues/5722, since the probability of triggering this kind of problem is very low, I would like to change the severity from critical to major.
@AkiraXie Is TLS enabled?
no
Reproduce in a HA testing env under disagg arch. A compute node crash with similar stack. All write nodes are stopped and there is no alive write node stores.
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.002552253Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.494 +08:00] [ERROR] [MPPTask.cpp:647] [\"task running meets error: Poco::Exception. Code: 1000, e.code() = 15, e.displayText() = Exception: no alive tiflash, cannot dispatch BatchCopTask, e.what() = Exception\"] [source=\"MPP<gather_id:<gather_id:1, query_ts:1713948667421748951, local_query_id:2980, server_id:1464686, start_ts:449301359464022031, resource_group: default>,task_id:2>\"] [thread_id=449]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.002358528Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.485 +08:00] [ERROR] [MPPTask.cpp:647] [\"task running meets error: Poco::Exception. Code: 1000, e.code() = 15, e.displayText() = Exception: no alive tiflash, cannot dispatch BatchCopTask, e.what() = Exception\"] [source=\"MPP<gather_id:<gather_id:1, query_ts:1713948667441032954, local_query_id:3362, server_id:1914233, start_ts:449301359477129218, resource_group: default>,task_id:2>\"] [thread_id=141]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001578854Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:563] [\"
0x7779901 faultSignalHandler(int, siginfo_t*, void*) [tiflash+125278465]
libs/libdaemon/src/BaseDaemon.cpp:214
0x7f3e1ff55630 <unknown symbol> [libpthread.so.0+63024]
0x93c36e6 cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+154941158]
contrib/grpc/src/core/lib/surface/completion_queue.cc:999
0x21258a9 DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+34756777]
dbms/src/Server/FlashGrpcServerHolder.cpp:50
0x212549d void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_7>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_7&&)::'lambda'(auto&&...)> >(void*) [tiflash+34755741]
/usr/local/bin/../include/c++/v1/thread:291
0x7f3e1ff4dea5 start_thread [libpthread.so.0+32421]\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001575793Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:419] [\"Unknown si_code.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001573143Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:407] [\"Access: read.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001570489Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:399] [\"Address: NULL pointer.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001567602Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:371] [\"(from thread 571) Received signal Segmentation fault(11).\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001564733Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:370] [########################################] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001553575Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:563] [\"
0x7779901 faultSignalHandler(int, siginfo_t*, void*) [tiflash+125278465]
libs/libdaemon/src/BaseDaemon.cpp:214
0x7f3e1ff55630 <unknown symbol> [libpthread.so.0+63024]
0x93c3ab3 cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+154942131]
contrib/grpc/src/core/lib/surface/completion_queue.cc:1005
0x21258a9 DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+34756777]
dbms/src/Server/FlashGrpcServerHolder.cpp:50
0x212549d void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_7>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_7&&)::'lambda'(auto&&...)> >(void*) [tiflash+34755741]
/usr/local/bin/../include/c++/v1/thread:291
0x7f3e1ff4dea5 start_thread [libpthread.so.0+32421]\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001550288Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:419] [\"Unknown si_code.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001547642Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:407] [\"Access: read.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001544919Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:399] [\"Address: NULL pointer.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001542085Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:371] [\"(from thread 501) Received signal Segmentation fault(11).\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.00153932Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:370] [########################################] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.000755732Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.437 +08:00] [ERROR] [MPPTask.cpp:647] [\"task running meets error: Poco::Exception. Code: 1000, e.code() = 15, e.displayText() = Exception: no alive tiflash, cannot dispatch BatchCopTask, e.what() = Exception\"] [source=\"MPP<gather_id:<gather_id:1, query_ts:1713948667405980939, local_query_id:3361, server_id:1914233, start_ts:449301359464022018, resource_group: default>,task_id:2>\"] [thread_id=129]\n"}
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
2. What did you expect to see? (Required)
no error
3. What did you see instead (Required)
error before crash:
[2023/03/03 04:29:48.332 +08:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 582) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:405] ["Address: NULL pointer."] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:413] ["Access: read."] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:425] ["Unknown si_code."] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:569] ["\n 0x7166471\tfaultSignalHandler(int, siginfo_t, void) [tiflash+118908017]\n \tlibs/libdaemon/src/BaseDaemon.cpp:220\n 0x7fb50e04cd90\t [libc.so.6+347536]\n 0x83a9146\tcq_next(grpc_completion_queue, gpr_timespec, void) [tiflash+138056006]\n \tcontrib/grpc/src/core/lib/surface/completion_queue.cc:999\n 0x1bface9\tDB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue, std::__1::shared_ptr const&) [tiflash+29338857]\n \tdbms/src/Server/FlashGrpcServerHolder.cpp:50\n 0x1bfa8cd\tvoid std::1::thread_proxy<std::1::tuple<std::1::unique_ptr<std::1::thread_struct, std::1::default_delete >, std:: 1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::1::shared_ptr const&)::$_5>(bool, std:: 1::basic_string<char, std::__1::char_traits, std::1::allocator >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std:: 1::shared_ptr const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+29337805]\n \t/usr/local/bin/../include/c++/v1/thread:291\n 0x7fb50e097802\tstart_thread [libc.so.6+653314]"] [source=BaseDaemon] [thread_id=11737]
4. What is your TiFlash version? (Required)
TiFlash Release Version: v6.7.0-alpha Edition: Community Git Commit Hash: fbed3eb9b09691015490ce1fd08254c309d0a1f8 Git Branch: heads/refs/tags/v6.7.0-alpha UTC Build Time: 2023-02-24 11:34:52 Enable Features: jemalloc sm4(GmSSL) avx2 avx512 unwind thinlto Profile: RELWITHDEBINFO
Raft Proxy Git Commit Hash: 9f3377b1dd390e9db141594f94a15064b456b0d4 Git Commit Branch: HEAD UTC Build Time: 2023-02-24 11:41:15 Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14) Storage Engine: tiflash Prometheus Prefix: tiflashproxy Profile: release Enable Features: Unknown (env var does not exist when building)