Open Lily2025 opened 1 year ago
/type bug /severity major /assign ethercflow
another oom with this scene inject tikv-0 io hang for 20mins,tikv-0 oom when fault recover
Besides the store channel, every region has its own channel, in this case, messages may be piled up in channel, the total memory usage is about *O(len(regions) cap(channel))**.
This suggests that TiKV may need a global memory quota for channel memory usage. If memory quota is full, TiKV should reject new messages.
another oom scene inject one of tikv network partition last for 50mins and recover,this tikv oom after fault recover
another oom scene inject one of tikv io delay 500ms or 100ms last for 10mins,this tikv oom when fault recover https://clinic.pingcap.com.cn/portal/#/orgs/31/clusters/7297268250385940865?from=1699059825&to=1699061064
Bug Report
What version of TiKV are you using?
./tikv-server -V TiKV Release Version: 6.4.0-alpha Edition: Community Git Commit Hash: 97ab36eb7147cde02c1654595f99104155ac0c21 Git Commit Branch: heads/refs/tags/v6.4.0-alpha UTC Build Time: 2022-11-02 11:01:58 Rust Version: rustc 1.64.0-nightly (0f4bcadb4 2022-07-30) Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure Profile: dist_release
What operating system and CPU are you using?
8c、32GB
Steps to reproduce
ha-tikv-random-data-io-hang-last-for-20m (https://tcms.pingcap.net/dashboard/executions/case/2788098)
What did you expect?
no om
What did happened?
tikv oom
from ethercflow The initial reason seems to be the stasis of raft messages received. Specifically, this node continues to receive message messages from the leader. After grpc receives the message, it sends it to the store thread through the channel. However, the store thread cannot consume from the channel due to io hang, which leads to the previous grpc rejection of too many raft messages to prevent the logical effectiveness of oom.