pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
944 stars 410 forks source link

one of cn was abnormal when all wn crash #8949

Open Lily2025 opened 5 months ago

Lily2025 commented 5 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run ch 2、all wn crash due to minio full

2. What did you expect to see? (Required)

all cn are normal

3. What did you see instead (Required)

one of cn was abnormal when all wn crash

image

{"container":"data0","namespace":"ha-test-serverless-htap-tps-7567184-1-220","log":"[BaseDaemon.cpp:563] [\"\\n       0x77783e1\\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+125273057]\\n                \\tlibs/libdaemon/src/BaseDaemon.cpp:214\\n  0x7fa29e265630\\t<unknown symbol> [libpthread.so.0+63024]\\n  0x7fa29daa4387\\tgsignal [libc.so.6+222087]\\n  0x7fa29daa5a78\\t__GI_abort [libc.so.6+227960]\\n       0x993c731\\tabsl::lts_20211102::raw_logging_internal::RawLog(absl::lts_20211102::LogSeverity, char const*, int, char const*, ...) [tiflash+160679729]\\n                \\tcontrib/abseil-cpp/absl/base/internal/raw_logging.cc:216\\n       0x9a9ed87\\tabsl::lts_20211102::base_internal::LowLevelAlloc::Alloc(unsigned long) [tiflash+162131335]\\n                \\tcontrib/abseil-cpp/absl/base/internal/low_level_alloc.cc:606\\n       0x9933bef\\tabsl::lts_20211102::synchronization_internal::CreateThreadIdentity() [tiflash+160644079]\\n                \\tcontrib/abseil-cpp/absl/synchronization/internal/create_thread_identity.cc:129\\n       0x9930eea\\tabsl::lts_20211102::Mutex::LockSlow(absl::lts_20211102::MuHowS const*, absl::lts_20211102::Condition const*, int) [tiflash+160632554]\\n                \\tcontrib/abseil-cpp/absl/synchronization/mutex.cc:1768\\n       0x9341027\\tpollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+154406951]\\n                \\tcontrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1127\\n       0x93c033b\\tcq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*) [tiflash+154927931]\\n                \\tcontrib/grpc/src/core/lib/surface/completion_queue.cc:1294\\n       0x8ea19da\\tgrpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::internal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) [tiflash+149559770]\\n                \\tcontrib/grpc/include/grpcpp/impl/codegen/client_unary_call.h:83\\n       0x97d5c19\\tgrpc::Status grpc::internal::BlockingUnaryCall<pdpb::GetRegionRequest, pdpb::GetRegionResponse, google::protobuf::MessageLite, google::protobuf::MessageLite>(grpc::ChannelInterface*, grpc::internal::RpcMethod const&, grpc::ClientContext*, pdpb::GetRegionRequest const&, pdpb::GetRegionResponse*) [tiflash+159210521]\\n                \\tcontrib/grpc/include/grpcpp/impl/codegen/client_unary_call.h:52\\n       0x91db48c\\tpingcap::pd::Client::getRegionByKey(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) [tiflash+152941708]\\n                \\tcontrib/client-c/src/pd/Client.cc:364\\n       0x8a701b0\\tpingcap::pd::CodecClient::getRegionByKey(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) [tiflash+145162672]\\n                \\tcontrib/client-c/include/pingcap/pd/CodecClient.h:22\\n       0x91c6452\\tpingcap::kv::RegionCache::locateKey(pingcap::kv::Backoffer&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) [tiflash+152855634]\\n                \\tcontrib/client-c/src/kv/RegionCache.cc:103\\n       0x91ec601\\tpingcap::coprocessor::buildBatchCopTasks(pingcap::kv::Backoffer&, pingcap::kv::Cluster*, bool, bool, std::__1::vector<long, std::__1::allocator<long> > const&, std::__1::vector<std::__1::vector<pingcap::coprocessor::KeyRange, std::__1::allocator<pingcap::coprocessor::KeyRange> >, std::__1::allocator<std::__1::vector<pingcap::coprocessor::KeyRange, std::__1::allocator<pingcap::coprocessor::KeyRange> > > > const&, pingcap::kv::StoreType, bool (* const&)(std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&), Poco::Logger*) [tiflash+153011713]\\n                \\tcontrib/client-c/src/coprocessor/Client.cc:367\\n       0x882708d\\tDB::StorageDisaggregated::buildBatchCopTasks(std::__1::vector<std::__1::pair<long, std::__1::vector<pingcap::coprocessor::KeyRange, std::__1::allocator<pingcap::coprocessor::KeyRange> > >, std::__1::allocator<std::__1::pair<long, std::__1::vector<pingcap::coprocessor::KeyRange, std::__1::allocator<pingcap::coprocessor::KeyRange> > > > > const&, bool (* const&)(std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&)) [tiflash+142766221]\\n                \\tdbms/src/Storages/StorageDisaggregated.cpp:172\\n       0x882d144\\tDB::StorageDisaggregated::buildReadTaskWithBackoff(DB::Context const&) [tiflash+142790980]\\n                \\tdbms/src/Storages/StorageDisaggregatedRemote.cpp:147\\n       0x882df76\\tDB::StorageDisaggregated::readThroughS3(DB::PipelineExecutorContext&, DB::PipelineExecGroupBuilder&, DB::Context const&, unsigned int) [tiflash+142794614]\\n                \\tdbms/src/Storages/StorageDisaggregatedRemote.cpp:105\\n       0x89f2a30\\tDB::PhysicalTableScan::buildPipeline(DB::PipelineBuilder&, DB::Context&, DB::PipelineExecutorContext&) [tiflash+144648752]\\n                \\tdbms/src/Flash/Planner/Plans/PhysicalTableScan.cpp:132\\n       0x896fd62\\tDB::PhysicalPlanNode::buildPipeline(DB::PipelineBuilder&, DB::Context&, DB::PipelineExecutorContext&) [tiflash+144112994]\\n                \\tdbms/src/Flash/Planner/PhysicalPlanNode.cpp:116\\n       0x896fd62\\tDB::PhysicalPlanNode::buildPipeline(DB::PipelineBuilder&, DB::Context&, DB::PipelineExecutorContext&) [tiflash+144112994]\\n                \\tdbms/src/Flash/Planner/PhysicalPlanNode.cpp:116\\n       0x896b349\\tDB::PhysicalPlan::toPipeline(DB::PipelineExecutorContext&, DB::Context&) [tiflash+144094025]\\n                \\tdbms/src/Flash/Planner/PhysicalPlan.cpp:325\\n       0x8924e97\\tDB::PipelineExecutor::PipelineExecutor(std::__1::shared_ptr<MemoryTracker> const&, DB::AutoSpillTrigger*, std::__1::function<void (std::__1::shared_ptr<DB::OperatorSpillContext> const&)> const&, DB::Context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) [tiflash+143806103]\\n                \\tdbms/src/Flash/Executor/PipelineExecutor.cpp:45\\n       0x8782f38\\tDB::(anonymous namespace)::executeAsPipeline(DB::Context&, bool) [tiflash+142094136]\\n                \\tdbms/src/Flash/executeQuery.cpp:199\\n       0x8782614\\tDB::queryExecute(DB::Context&, bool) [tiflash+142091796]\\n                \\tdbms/src/Flash/executeQuery.cpp:239\\n       0x88cabcb\\tDB::MPPTask::runImpl() [tiflash+143436747]\\n                \\tdbms/src/Flash/Mpp/MPPTask.cpp:527\\n       0x1fef248\\tauto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::'lambda'()::operator()() [tiflash+33485384]\\n                \\tdbms/src/Common/wrapInvocable.h:36\\n       0x1eda2b3\\tDB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&) [tiflash+32350899]\\n                \\tdbms/src/Common/DynamicThreadPool.cpp:124\\n       0x1eda6f6\\tDB::DynamicThreadPool::dynamicWork(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >) [tiflash+32351990]\\n                \\tdbms/src/Common/DynamicThreadPool.cpp:148\\n       0x1edad12\\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::*)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >), DB::DynamicThreadPool*, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> > >(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, void (DB::DynamicThreadPool::*&&)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >), DB::DynamicThreadPool*&&, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&&)::'lambda'(auto&&...), DB::DynamicThreadPool*, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> > > >(void*) [tiflash+32353554]\\n                \\t/usr/local/bin/../include/c++/v1/thread:291\\n  0x7fa29e25dea5\\tstart_thread [libpthread.so.0+32421]\"] [source=BaseDaemon] [thread_id=32480]","pod":"secondary-tc-tiflash-0","level":"ERROR"}

4. What is your TiFlash version? (Required)

/tiflash/tiflash version
 TiFlash
Release Version: v7.1.0-alpha-553-ge86ad8e
Edition:         Community
Git Commit Hash: e86ad8e000690337b205ed8aa1e2afeca38c4f55
Git Branch:      cloud-engine-on-release-7.5
UTC Build Time:  2024-04-12 03:00:05
Enable Features: jemalloc sm4(GmSSL) avx2 avx512 unwind thinlto
Profile:         RELWITHDEBINFO

Raft Proxy
Git Commit Hash:   200fa6be0189a635a56470c0cd2f5dd700e2228f
Git Commit Branch: HEAD
UTC Build Time:    2024-04-12 03:03:50
Rust Version:      rustc 1.67.0-nightly (96ddd32c4 2022-11-14)
Storage Engine:    tiflash
Prometheus Prefix: tiflash_proxy_
Profile:           release
Enable Features:   "raftstore-proxy/external-jemalloc" portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure
2024-04-15T05:57:51.515+0800    
Lily2025 commented 5 months ago

/assign JaySon-Huang

JaySon-Huang commented 5 months ago

Seems after all wn crashes, the running "mpp_tasks" on cn does not get canceled correctly

image