iotaledger / iota-core

Apache License 2.0
60 stars 12 forks source link

OOM Crash - failed to store on BlockDropped in retainer #934

Closed shufps closed 5 months ago

shufps commented 5 months ago

We have two nodes that crashed on out of memory.

It seems they started to log this error message:

Protocol.Engine0        engine error (err=blockRetainer: failed to store on BlockDropped in retainer: cannot update block metadata for block BlockID(0xbc718142f4c3957f2e7484dec30b891a9edfc09b2d50c8faa8d753d09bb8dc12d4830000:33748) with state dropped as block is already committed)

About 50k times per hour.

Memory inflated at the time: image

We have a log file when it started: faucet.h.iota2-alphanet_2024-04-24-09.log

Unfortunately it happened at night, so we have no memory profile of this node.

But we have profile of another node that started at the same time but "recovered" later on (while memory usage still is high) image

pprof.validator-2_20240425-075134_all.zip

Maybe it shows something :see_no_evil:

alexsporn commented 5 months ago

Same underlying deadlock in the DDR-Scheduler as in #936

alexsporn commented 5 months ago
goroutine 8456281 [sync.RWMutex.RLock, 1150 minutes]:
sync.runtime_SemacquireRWMutexR(0xc00048bb08?, 0xa0?, 0xc0004ef560?)
    /usr/local/go/src/runtime/sema.go:82 +0x25
sync.(*RWMutex).RLock(...)
    /usr/local/go/src/sync/rwmutex.go:70
github.com/iotaledger/iota-core/pkg/protocol/engine/congestioncontrol/scheduler/drr.(*Scheduler).ReadyBlocksCount(0xc000346fa0)