tikv / tikv

Distributed transactional key-value database, originally created to complement TiDB
https://tikv.org
Apache License 2.0
15.19k stars 2.14k forks source link

tikv oom after this tikv io delay/hang or one of tikv network partition last for 50mins and recover #13731

Open Lily2025 opened 1 year ago

Lily2025 commented 1 year ago

Bug Report

What version of TiKV are you using?

./tikv-server -V TiKV Release Version: 6.4.0-alpha Edition: Community Git Commit Hash: 97ab36eb7147cde02c1654595f99104155ac0c21 Git Commit Branch: heads/refs/tags/v6.4.0-alpha UTC Build Time: 2022-11-02 11:01:58 Rust Version: rustc 1.64.0-nightly (0f4bcadb4 2022-07-30) Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure Profile: dist_release

What operating system and CPU are you using?

8c、32GB

Steps to reproduce

ha-tikv-random-data-io-hang-last-for-20m (https://tcms.pingcap.net/dashboard/executions/case/2788098)

What did you expect?

no om

What did happened?

tikv oom image

image

image

from ethercflow The initial reason seems to be the stasis of raft messages received. Specifically, this node continues to receive message messages from the leader. After grpc receives the message, it sends it to the store thread through the channel. However, the store thread cannot consume from the channel due to io hang, which leads to the previous grpc rejection of too many raft messages to prevent the logical effectiveness of oom.

Lily2025 commented 1 year ago

/type bug /severity major /assign ethercflow

Lily2025 commented 1 year ago

another oom with this scene inject tikv-0 io hang for 20mins,tikv-0 oom when fault recover saPGiIRMZr

image

overvenus commented 1 year ago

Besides the store channel, every region has its own channel, in this case, messages may be piled up in channel, the total memory usage is about *O(len(regions) cap(channel))**.

This suggests that TiKV may need a global memory quota for channel memory usage. If memory quota is full, TiKV should reject new messages.

Lily2025 commented 1 year ago

another oom scene inject one of tikv network partition last for 50mins and recover,this tikv oom after fault recover image image image

Lily2025 commented 11 months ago

another oom scene inject one of tikv io delay 500ms or 100ms last for 10mins,this tikv oom when fault recover https://clinic.pingcap.com.cn/portal/#/orgs/31/clusters/7297268250385940865?from=1699059825&to=1699061064 image