pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
949 stars 410 forks source link

tiflash meets crash #6935

Open AkiraXie opened 1 year ago

AkiraXie commented 1 year ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. tiflash meet crash when running nto and tpcc workload

2. What did you expect to see? (Required)

no error

3. What did you see instead (Required)

error before crash:

[2023/03/03 04:29:48.332 +08:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 582) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:405] ["Address: NULL pointer."] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:413] ["Access: read."] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:425] ["Unknown si_code."] [source=BaseDaemon] [thread_id=11737] [2023/03/03 04:29:48.333 +08:00] [ERROR] [BaseDaemon.cpp:569] ["\n 0x7166471\tfaultSignalHandler(int, siginfo_t, void) [tiflash+118908017]\n \tlibs/libdaemon/src/BaseDaemon.cpp:220\n 0x7fb50e04cd90\t [libc.so.6+347536]\n 0x83a9146\tcq_next(grpc_completion_queue, gpr_timespec, void) [tiflash+138056006]\n \tcontrib/grpc/src/core/lib/surface/completion_queue.cc:999\n 0x1bface9\tDB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue, std::__1::shared_ptr const&) [tiflash+29338857]\n \tdbms/src/Server/FlashGrpcServerHolder.cpp:50\n 0x1bfa8cd\tvoid std::1::thread_proxy<std::1::tuple<std::1::unique_ptr<std::1::thread_struct, std::1::default_delete >, std::1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::1::shared_ptr const&)::$_5>(bool, std::1::basic_string<char, std::__1::char_traits, std::1::allocator >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::1::shared_ptr const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+29337805]\n \t/usr/local/bin/../include/c++/v1/thread:291\n 0x7fb50e097802\tstart_thread [libc.so.6+653314]"] [source=BaseDaemon] [thread_id=11737]

4. What is your TiFlash version? (Required)

TiFlash Release Version: v6.7.0-alpha Edition: Community Git Commit Hash: fbed3eb9b09691015490ce1fd08254c309d0a1f8 Git Branch: heads/refs/tags/v6.7.0-alpha UTC Build Time: 2023-02-24 11:34:52 Enable Features: jemalloc sm4(GmSSL) avx2 avx512 unwind thinlto Profile: RELWITHDEBINFO

Raft Proxy Git Commit Hash: 9f3377b1dd390e9db141594f94a15064b456b0d4 Git Commit Branch: HEAD UTC Build Time: 2023-02-24 11:41:15 Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14) Storage Engine: tiflash Prometheus Prefix: tiflashproxy Profile: release Enable Features: Unknown (env var does not exist when building)

AkiraXie commented 1 year ago

/severity critical

zanmato1984 commented 1 year ago

@AkiraXie Is TLS enabled?

windtalker commented 1 year ago

This is a crash from grpc-core, we have met some random crash from grpc-core before, such as https://github.com/pingcap/tiflash/issues/5722, since the probability of triggering this kind of problem is very low, I would like to change the severity from critical to major.

AkiraXie commented 1 year ago

@AkiraXie Is TLS enabled?

no

JaySon-Huang commented 7 months ago

Reproduce in a HA testing env under disagg arch. A compute node crash with similar stack. All write nodes are stopped and there is no alive write node stores.

{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.002552253Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.494 +08:00] [ERROR] [MPPTask.cpp:647] [\"task running meets error: Poco::Exception. Code: 1000, e.code() = 15, e.displayText() = Exception: no alive tiflash, cannot dispatch BatchCopTask, e.what() = Exception\"] [source=\"MPP<gather_id:<gather_id:1, query_ts:1713948667421748951, local_query_id:2980, server_id:1464686, start_ts:449301359464022031, resource_group: default>,task_id:2>\"] [thread_id=449]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.002358528Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.485 +08:00] [ERROR] [MPPTask.cpp:647] [\"task running meets error: Poco::Exception. Code: 1000, e.code() = 15, e.displayText() = Exception: no alive tiflash, cannot dispatch BatchCopTask, e.what() = Exception\"] [source=\"MPP<gather_id:<gather_id:1, query_ts:1713948667441032954, local_query_id:3362, server_id:1914233, start_ts:449301359477129218, resource_group: default>,task_id:2>\"] [thread_id=141]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001578854Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:563] [\"
       0x7779901 faultSignalHandler(int, siginfo_t*, void*) [tiflash+125278465]
                 libs/libdaemon/src/BaseDaemon.cpp:214
  0x7f3e1ff55630  <unknown symbol> [libpthread.so.0+63024]
       0x93c36e6    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+154941158]
                  contrib/grpc/src/core/lib/surface/completion_queue.cc:999
       0x21258a9   DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+34756777]
                  dbms/src/Server/FlashGrpcServerHolder.cpp:50
       0x212549d    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_7>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_7&&)::'lambda'(auto&&...)> >(void*) [tiflash+34755741]
                    /usr/local/bin/../include/c++/v1/thread:291
  0x7f3e1ff4dea5   start_thread [libpthread.so.0+32421]\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001575793Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:419] [\"Unknown si_code.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001573143Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:407] [\"Access: read.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001570489Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:399] [\"Address: NULL pointer.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001567602Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:371] [\"(from thread 571) Received signal Segmentation fault(11).\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001564733Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:370] [########################################] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001553575Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.477 +08:00] [ERROR] [BaseDaemon.cpp:563] [\"
       0x7779901 faultSignalHandler(int, siginfo_t*, void*) [tiflash+125278465]
                 libs/libdaemon/src/BaseDaemon.cpp:214
  0x7f3e1ff55630  <unknown symbol> [libpthread.so.0+63024]
       0x93c3ab3    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+154942131]
                  contrib/grpc/src/core/lib/surface/completion_queue.cc:1005
       0x21258a9  DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+34756777]
                 dbms/src/Server/FlashGrpcServerHolder.cpp:50
       0x212549d   void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_7>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_7&&)::'lambda'(auto&&...)> >(void*) [tiflash+34755741]
                   /usr/local/bin/../include/c++/v1/thread:291
  0x7f3e1ff4dea5  start_thread [libpthread.so.0+32421]\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001550288Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:419] [\"Unknown si_code.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001547642Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:407] [\"Access: read.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001544919Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:399] [\"Address: NULL pointer.\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.001542085Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:371] [\"(from thread 501) Received signal Segmentation fault(11).\"] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.00153932Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.473 +08:00] [ERROR] [BaseDaemon.cpp:370] [########################################] [source=BaseDaemon] [thread_id=856]\n"}
{"pod":"secondary-tc-tiflash-1","container":"serverlog","time":"2024-04-24T08:51:08.000755732Z","stream":"stdout","namespace":"ha-test-serverless-vector-tps-7571346-1-974","log":"[2024/04/24 16:51:07.437 +08:00] [ERROR] [MPPTask.cpp:647] [\"task running meets error: Poco::Exception. Code: 1000, e.code() = 15, e.displayText() = Exception: no alive tiflash, cannot dispatch BatchCopTask, e.what() = Exception\"] [source=\"MPP<gather_id:<gather_id:1, query_ts:1713948667405980939, local_query_id:3361, server_id:1914233, start_ts:449301359464022018, resource_group: default>,task_id:2>\"] [thread_id=129]\n"}

image