pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
941 stars 410 forks source link

tiflash compute node crash after injection some fault such as network partition #9378

Closed Lily2025 closed 2 weeks ago

Lily2025 commented 3 weeks ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run ch 2、inject one of cn network partition

2. What did you expect to see? (Required)

no crash

3. What did you see instead (Required)

tiflash compute node crash

{"container":"errorlog","stream":"stdout","namespace":"ha-test-disagg-tiflash-tps-7624098-1-212","pod":"secondary-tc-tiflash-0","log":"[2024/08/27 04:27:07.278 +08:00] [ERROR] [BaseDaemon.cpp:560] [\"\n 0x563f5e51235e\tfaultSignalHandler(int, siginfo_t, void) [tiflash+124760926]\n \tlibs/libdaemon/src/BaseDaemon.cpp:211\n 0x7f96629e46f0\t [libc.so.6+255728]\n 0x7f9662a3194c\tpthread_kill_implementation [libc.so.6+571724]\n 0x7f96629e4646\tGI_raise [libc.so.6+255558]\n 0x7f96629ce7f3\tabort [libc.so.6+165875]\n 0x7f96629cf130\tlibc_message.cold [libc.so.6+168240]\n 0x7f96629dd1d7\tlibc_assert_fail [libc.so.6+225751]\n 0x7f9662a38109\tpthread_tpp_change_priority [libc.so.6+598281]\n 0x7f9662a329c5\tpthread_mutex_lock_full [libc.so.6+575941]\n 0x7f9667b987c6\tstd::1::mutex::lock() [libc++.so.1+587718]\n 0x563f5fb45843\tstd::1::function::func<DB::StorageDisaggregated::buildReadTaskForWriteNodeTable(DB::Context const&, std::1::shared_ptr const&, DB::DM::DisaggTaskId const&, unsigned long, std::__1::basic_string<char, std::1::char_traits, std::1::allocator> const&, std::1::basic_string<char, std::1::char_traits, std::1::allocator> const&, std::1::mutex&, std::1::list<std::1::shared_ptr, std::1::allocator<std::1::shared_ptr>>&)::$_0, std::1::allocator<DB::StorageDisaggregated::buildReadTaskForWriteNodeTable(DB::Context const&, std::1::shared_ptr const&, DB::DM::DisaggTaskId const&, unsigned long, std::__1::basic_string<char, std::1::char_traits, std::1::allocator> const&, std::1::basic_string<char, std::1::char_traits, std::1::allocator> const&, std::1::mutex&, std::1::list<std::1::shared_ptr, std::1::allocator<std::1::shared_ptr>>&)::$_0>, void ()>::operator()() (.139ff689715caee4ff84ce0b2eee41ae) [tiflash+148039747]\n \t/usr/local/bin/../include/c++/v1/mutex/lock_guard.h:35\n 0x563f58ef4e65\tstd::1::packaged_task<void ()>::operator()() [tiflash+34463333]\n \t/usr/local/bin/../include/c++/v1/future:1891\n 0x563f58ef3119\tDB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl>::worker(std::1::list_iterator<DB::ThreadFromGlobalPoolImpl, void*>) [tiflash+34455833]\n \t/usr/local/bin/../include/c++/v1/functional/function.h:517\n 0x563f58ef5973\tstd::1::function::func<DB::ThreadFromGlobalPoolImpl::ThreadFromGlobalPoolImpl<void DB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl>::scheduleImpl(std::1::function<void ()>, long, std::1::optional, bool)::'lambda0'()>(void&&)::'lambda'(), std::1::allocator<DB::ThreadFromGlobalPoolImpl::ThreadFromGlobalPoolImpl<void DB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl>::scheduleImpl(std::1::function<void ()>, long, std::1::optional, bool)::'lambda0'()>(void&&)::'lambda'()>, void ()>::operator()() [tiflash+34466163]\n \tdbms/src/Common/UniThreadPool.cpp:160\n 0x563f58ef4608\tvoid std::1::thread_proxy[abi:ue170006]<std::1::tuple<std::1::unique_ptr<std::1::thread_struct, std::1::default_delete>, void DB::ThreadPoolImpl<std::1::thread>::scheduleImpl(std::1::function<void ()>, long, std::1::optional, bool)::'lambda0'()>>(void) [tiflash+34461192]\n \t/usr/local/bin/../include/c++/v1/__functional/function.h:517\n 0x7f9662a2fc02\tstart_thread [libc.so.6+564226]\"] [source=BaseDaemon] [thread_id=145]\n","time":"2024-08-26T20:27:08.239802511Z"}

4. What is your TiFlash version? (Required)

/tiflash/tiflash version TiFlash Release Version: v8.4.0-alpha Edition: Community Git Commit Hash: 81cd94782a912d3a61e38b7ebc5c2cfabe5694e9 Git Branch: heads/refs/tags/v8.4.0-alpha UTC Build Time: 2024-08-26 11:38:41 Enable Features: jemalloc sm4(GmSSL) mem-profiling avx2 avx512 unwind thinlto Profile: RELWITHDEBINFO Compiler: clang++ 17.0.6

Raft Proxy Git Commit Hash: f2e5fb8878eb51492c54f1094a847e0b958c6bb8 Git Commit Branch: HEAD UTC Build Time: ""
Rust Version: rustc 1.77.0-nightly (89e2160c4 2023-12-27) Storage Engine: tiflash Prometheus Prefix: tiflashproxy Profile: release Enable Features: external-jemalloc portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine openssl-vendored 2024-08-27T03:04:36.672+0800 INFO k8s/client.go:135 it should be noted that a long-running command will not be interrupted even the use case has ended. For more information, please refer to https://github.com/pingcap/test-infra/discussions/129 ./br -V Release Version: v8.4.0-alpha Git Commit Hash: 4eeeef8a1bbf22c2fd1bdd1e61f303e5d52764e0 Git Branch: heads/refs/tags/v8.4.0-alpha Go Version: go1.21.10 UTC Build Time: 2024-08-26 11:37:18 Race Enabled: false

Lily2025 commented 3 weeks ago

/assign JinheLin

Lily2025 commented 3 weeks ago

/severity major

JinheLin commented 2 weeks ago

Fixed by https://github.com/pingcap/tiflash/pull/9382