pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
941 stars 410 forks source link

disagg: too many request make tiflash compute node crash #9334

Closed Lily2025 closed 4 weeks ago

Lily2025 commented 1 month ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run ch 2、inject one of cn network partition

2. What did you expect to see? (Required)

no crash

3. What did you see instead (Required)

tiflash cn crash occurs after the network isolation recovery

{"stream":"stdout","container":"errorlog","pod":"secondary-tc-tiflash-0","namespace":"ha-test-disagg-tiflash-tps-7552417-1-58","time":"2024-08-19T17:34:59.20448412Z","log":"[2024/08/20 01:34:58.361 +08:00] [ERROR] [BaseDaemon.cpp:560] [\"\n 0x55a4c9778b9e\tfaultSignalHandler(int, siginfo_t, void) [tiflash+124169118]\n \tlibs/libdaemon/src/BaseDaemon.cpp:211\n 0x7fb214a5e6f0\t [libc.so.6+255728]\n 0x55a4c95a7d9a\tDB::DM::SegmentReadTask::SegmentReadTask(std::1::shared_ptr const&, DB::Context const&, std::__1::shared_ptr const&, DB::DM::RemotePb::RemoteSegment const&, DB::DM::DisaggTaskId const&, unsigned long, std::1::basic_string<char, std::1::char_traits, std::1::allocator> const&, unsigned int, long) [tiflash+122264986]\n \t/usr/local/bin/../include/c++/v1/memory/shared_ptr.h:884\n 0x55a4cad9eb63\tstd::1::function::func<DB::StorageDisaggregated::buildReadTaskForWriteNodeTable(DB::Context const&, std::1::shared_ptr const&, DB::DM::DisaggTaskId const&, unsigned long, std::__1::basic_string<char, std::1::char_traits, std::1::allocator> const&, std::1::basic_string<char, std::1::char_traits, std::1::allocator> const&, std::1::mutex&, std::1::list<std::1::shared_ptr, std::1::allocator<std::1::shared_ptr>>&)::$_0, std::1::allocator<DB::StorageDisaggregated::buildReadTaskForWriteNodeTable(DB::Context const&, std::1::shared_ptr const&, DB::DM::DisaggTaskId const&, unsigned long, std::__1::basic_string<char, std::1::char_traits, std::1::allocator> const&, std::1::basic_string<char, std::1::char_traits, std::1::allocator> const&, std::1::mutex&, std::1::list<std::1::shared_ptr, std::1::allocator<std::1::shared_ptr>>&)::$_0>, void ()>::operator()() (.139ff689715caee4ff84ce0b2eee41ae) [tiflash+147393379]\n \t/usr/local/bin/../include/c++/v1/memory/construct_at.h:41\n 0x55a4c9a903b5\tauto DB::wrapInvocable<std::1::function<void ()>>(bool, std::1::function<void ()>&&)::'lambda'()::operator()() [tiflash+127411125]\n \t/usr/local/bin/../include/c++/v1/functional/function.h:517\n 0x55a4c41e60c5\tstd::1::packaged_task<void ()>::operator()() [tiflash+34439365]\n \t/usr/local/bin/../include/c++/v1/future:1891\n 0x55a4c419e4d6\tDB::DynamicThreadPool::executeTask(std::1::unique_ptr<DB::IExecutableTask, std::__1::default_delete>&) [tiflash+34145494]\n \tdbms/src/Common/DynamicThreadPool.cpp:124\n 0x55a4c419e973\tDB::DynamicThreadPool::dynamicWork(std::1::unique_ptr<DB::IExecutableTask, std::1::default_delete>) [tiflash+34146675]\n \tdbms/src/Common/DynamicThreadPool.cpp:148\n 0x55a4c419f3df\tvoid* std::1::thread_proxy[abi:ue170006]<std::1::tuple<std::1::unique_ptr<std::1::thread_struct, std::1::default_delete>, std::1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::*)(std::1::unique_ptr<DB::IExecutableTask, std::1::default_delete>), DB::DynamicThreadPool*, std::__1::unique_ptr<DB::IExecutableTask, std::1::default_delete>>(bool, std::1::basic_string<char, std::__1::char_traits, std::1::allocator>, void (DB::DynamicThreadPool::&&)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete>), DB::DynamicThreadPool&&, std::1::unique_ptr<DB::IExecutableTask, std::__1::default_delete>&&)::'lambda'(auto&&...), DB::DynamicThreadPool*, std::1::unique_ptr<DB::IExecutableTask, std::__1::default_delete>>>(void*) [tiflash+34149343]\n \t/usr/local/bin/../include/c++/v1/__type_traits/invoke.h:308\n 0x7fb214aa9c02\tstart_thread [libc.so.6+564226]\"] [source=BaseDaemon] [thread_id=30184]\n"}

4. What is your TiFlash version? (Required)

/tiflash/tiflash version TiFlash Release Version: v8.3.0-alpha Edition: Community Git Commit Hash: 14ed7c021b23fffda165ad6a23ea4358001bd54e Git Branch: heads/refs/tags/v8.3.0-alpha UTC Build Time: 2024-08-15 11:39:16 Enable Features: jemalloc sm4(GmSSL) mem-profiling avx2 avx512 unwind thinlto Profile: RELWITHDEBINFO Compiler: clang++ 17.0.6

Raft Proxy Git Commit Hash: 4ebe44d321d4c738d89bc145d418b1d6f3464862 Git Commit Branch: HEAD UTC Build Time: ""
Rust Version: rustc 1.77.0-nightly (89e2160c4 2023-12-27) Storage Engine: tiflash Prometheus Prefix: tiflashproxy Profile: release Enable Features: external-je

Lily2025 commented 1 month ago

/assign CalvinNeo

Lily2025 commented 1 month ago

/severity critical

JinheLin commented 1 month ago

img_v3_02du_ce3193f6-9c4b-415c-8b5c-ee1cd4d80adg

The reason is that too many threads were created in StorageDisaggregated, resulting in thread creation failure.

std::__1::system_error, e.what() = thread constructor failed: Resource temporarily unavailable。
JaySon-Huang commented 1 month ago

Change it to an enhancement because it is caused by a large amount of requests making too many threads. We will try to reduce the number of threads created for handling disaggregated requests.