pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
946 stars 409 forks source link

(serverless)tiflash wn panic due to "Memory limit exceeded" #8778

Closed mayjiang0203 closed 2 months ago

mayjiang0203 commented 8 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

test case: [htap_bigdata_consistency_004]

2. What did you expect to see? (Required)

no error

3. What did you see instead (Required)

"log":"[2024/02/20 09:12:35.468 +00:00] [ERROR] [BaseDaemon.cpp:416] [\"Address not mapped to object.\"] [source=BaseDaemon] [thread_id=149]\n","namespace":"endless-cse-htap-consistency2-v71-tps-6750005-1-397","stream":"stdout","time":"2024-02-20T09:12:35.560463261Z"}
"log":"[2024/02/20 09:12:35.468 +00:00] [ERROR] [BaseDaemon.cpp:407] [\"Access: read.\"] [source=BaseDaemon] [thread_id=149]\n","namespace":"endless-cse-htap-consistency2-v71-tps-6750005-1-397","stream":"stdout","time":"2024-02-20T09:12:35.560460038Z"}
"log":"[2024/02/20 09:12:35.468 +00:00] [ERROR] [BaseDaemon.cpp:401] [\"Address: 0x40\"] [source=BaseDaemon] [thread_id=149]\n","namespace":"endless-cse-htap-consistency2-v71-tps-6750005-1-397","stream":"stdout","time":"2024-02-20T09:12:35.560457177Z"}
"log":"[2024/02/20 09:12:35.468 +00:00] [ERROR] [BaseDaemon.cpp:371] [\"(from thread 141) Received signal Segmentation fault(11).\"] [source=BaseDaemon] [thread_id=149]\n","namespace":"endless-cse-htap-consistency2-v71-tps-6750005-1-397","stream":"stdout","time":"2024-02-20T09:12:35.560454239Z"}
"log":"[2024/02/20 09:12:35.468 +00:00] [ERROR] [BaseDaemon.cpp:370] [########################################] [source=BaseDaemon] [thread_id=149]\n","namespace":"endless-cse-htap-consistency2-v71-tps-6750005-1-397","stream":"stdout","time":"2024-02-20T09:12:35.560451301Z"}
"log":"[2024/02/20 09:12:35.452 +00:00] [ERROR] [Exception.cpp:91] [\"Code: 0, e.displayText() = DB::Exception: Memory limit exceeded caused by 'RSS(Resident Set Size) much larger than limit' : process memory size would be 26.62 GiB for (attempt to allocate chunk of 2097152 bytes), limit of memory for data computing : 25.60 GiB. Memory Usage of Storage: non-query: peak=8.64 GiB, amount=3.73 GiB; query-storage-task: peak=738.57 MiB, amount=208.75 MiB; fetch-pages: peak=16.51 MiB, amount=0.00 B; shared-column-data: peak=738.57 MiB, amount=210.75 MiB., e.what() = DB::Exception, Stack trace:
       0x1e8f6b1    DB::TiFlashException::TiFlashException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::TiFlashError const&) [tiflash+32044721]
                    dbms/src/Common/TiFlashException.h:263
       0x1e8e63e    MemoryTracker::alloc(long, bool) [tiflash+32040510]
                    dbms/src/Common/MemoryTracker.cpp:214
       0x1e8e21a    MemoryTracker::alloc(long, bool) [tiflash+32039450]
                    dbms/src/Common/MemoryTracker.cpp:225
       0x1e9fc6f    Allocator<false>::realloc(void*, unsigned long, unsigned long, unsigned long) [tiflash+32111727]
                    dbms/src/Common/Allocator.cpp:153
       0x1f0e1bc    void DB::PODArrayBase<1ul, 4096ul, Allocator<false>, 15ul, 16ul>::realloc<>(unsigned long) [tiflash+32563644]
                    dbms/src/Common/PODArray.h:178
       0x7f3c722    DB::ColumnString::insertRangeFrom(DB::IColumn const&, unsigned long, unsigned long) [tiflash+133416738]
                    dbms/src/Columns/ColumnString.cpp:98
       0x76612fb    DB::DM::ColumnFileInMemory::append(DB::DM::DMContext const&, DB::Block const&, unsigned long, unsigned long, unsigned long) [tiflash+124130043]
                    dbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileInMemory.cpp:91
       0x7697238    DB::DM::MemTableSet::appendToCache(DB::DM::DMContext&, DB::Block const&, unsigned long, unsigned long) [tiflash+124351032]
                    dbms/src/Storages/DeltaMerge/Delta/MemTableSet.cpp:192
       0x753facf    DB::DM::Segment::writeToCache(DB::DM::DMContext&, DB::Block const&, unsigned long, unsigned long) [tiflash+122944207]
                    dbms/src/Storages/DeltaMerge/Segment.cpp:522
       0x74d115d    DB::DM::DeltaMergeStore::write(DB::Context const&, DB::Settings const&, DB::Block&) [tiflash+122491229]
                    dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp:613
       0x8a77d88    DB::writeRegionDataToStorage(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, std::__1::shared_ptr<DB::Logger> const&)::$_2::operator()(bool) const [tiflash+145194376]
                    dbms/src/Storages/KVStore/Decode/PartitionStreams.cpp:153
       0x8a741ac    DB::writeRegionDataToStorage(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, std::__1::shared_ptr<DB::Logger> const&) [tiflash+145179052]
                    dbms/src/Storages/KVStore/Decode/PartitionStreams.cpp:199
       0x8a73e16    DB::RegionTable::writeBlockByRegion(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, std::__1::shared_ptr<DB::Logger> const&, bool) [tiflash+145178134]
                    dbms/src/Storages/KVStore/Decode/PartitionStreams.cpp:390
       0x89c90a6    DB::Region::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, DB::TMTContext&) [tiflash+144478374]
                    dbms/src/Storages/KVStore/Region.cpp:913
       0x89b516b    DB::KVStore::handleWriteRaftCmdInner(DB::WriteCmdsView const&, unsigned long, unsigned long, unsigned long, DB::TMTContext&, std::__1::optional<DB::DM::RaftWriteResult>&) [tiflash+144396651]
                    dbms/src/Storages/KVStore/KVStore.cpp:285
       0x89b67b4    DB::KVStore::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, unsigned long, DB::TMTContext&) [tiflash+144402356]
                    dbms/src/Storages/KVStore/KVStore.cpp:349
       0x89f7ce5    HandleWriteRaftCmd [tiflash+144669925]
                    dbms/src/Storages/KVStore/FFI/ProxyFFI.cpp:98
  0x7f940c37de38    _$LT$engine_store_ffi..observer..TiFlashObserver$LT$T$C$ER$GT$$u20$as$u20$raftstore..coprocessor..QueryObserver$GT$::post_exec_query::h08204dbfa87355a2 [libtiflash_proxy.so+25849400]
  0x7f940d2080ec    raftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::process_raft_cmd::hbe7730616b6bce99 [libtiflash_proxy.so+41095404]
  0x7f940d20f1b0    raftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::handle_raft_committed_entries::h2b43a141f5adb68b [libtiflash_proxy.so+41124272]
  0x7f940d1e901c    raftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_apply::hca949330c5de182e [libtiflash_proxy.so+40968220]
  0x7f940d1ed5a2    raftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_tasks::he9b741f790fff6af [libtiflash_proxy.so+40986018]
  0x7f940c46503e    _$LT$raftstore..store..fsm..apply..ApplyPoller$LT$EK$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..apply..ApplyFsm$LT$EK$GT$$C$raftstore..store..fsm..apply..ControlFsm$GT$$GT$::handle_normal::hb3eef3de7f9b9647 [libtiflash_proxy.so+26796094]
  0x7f940c3e23f3    batch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::ha5b8a09338f8b985 [libtiflash_proxy.so+26260467]
  0x7f940c4d0522    std::sys_common::backtrace::__rust_begin_short_backtrace::h6b6bac765e41e6cf [libtiflash_proxy.so+27235618]
  0x7f940c51c1fe    core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::ha18d3e96d248a746 [libtiflash_proxy.so+27546110]
  0x7f940d9c1295    std::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+49193621]
                    /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108
  0x7f940a3b8ea5    start_thread [libpthread.so.0+32421]
  0x7f9409cc796d    __clone [libc.so.6+1042797]\"] [source=\"DB::EngineStoreApplyRes DB::HandleWriteRaftCmd(const DB::EngineStoreServerWrap *, DB::WriteCmdsView, DB::RaftCmdHeader)\"] [thread_id=91]\n","namespace":"endless-cse-htap-consistency2-v71-tps-6750005-1-397","stream":"stdout","time":"2024-02-20T09:12:35.560418429Z"}
"log":"[2024/02/20 09:12:35.165 +00:00] [ERROR] [Region.cpp:937] [\"[region_id=2419 applied_term=6 applied_index=67618] catch exception: Memory limit exceeded caused by 'RSS(Resident Set Size) much larger than limit' : process memory size would be 26.62 GiB for (attempt to allocate chunk of 2097152 bytes), limit of memory for data computing : 25.60 GiB. Memory Usage of Storage: non-query: peak=8.64 GiB, amount=3.73 GiB; query-storage-task: peak=738.57 MiB, amount=208.75 MiB; fetch-pages: peak=16.51 MiB, amount=0.00 B; shared-column-data: peak=738.57 MiB, amount=210.75 MiB., while applying `RegionTable::writeBlockByRegion` on [term 6, index 67619], entries PUT|write|7800000174800000FF00000000EF5F7280FF0000000017A8B700FEF9C8E8D8EB4BFFF3:DEL|lock|

4. What is your TiFlash version? (Required)

v7.1.0

mayjiang0203 commented 8 months ago

/severity critical /assign @JinheLin

JaySon-Huang commented 8 months ago

The problem is the test env for wn under disagg arch is 32GiB, which is not that sufficient. If the memory limit happen to thrown in raft thread, it will make tiflash crash.

JaySon-Huang commented 2 months ago

This should be an issue that happen on the cse-proxy branch. (The https://github.com/pingcap/tidb-engine-ext is not affected.) Too many raft log are added to in-memory WriteBatches pending for delete. Should be fixed by https://github.com/tidbcloud/cloud-storage-engine/pull/1757