hashgraph / hedera-services

Crypto, token, consensus, file, and smart contract services for the Hedera public ledger
Apache License 2.0
268 stars 121 forks source link

Java heap OOM in the teacher during reconnect #11086

Open OlegMazurov opened 6 months ago

OlegMazurov commented 6 months ago

Description

The reconnect connection gets broken due to a problem on the learner side. After a while, the teacher dies with OOM.

Steps to reproduce

The issue was observed with a single-node mode reconnect testing framework. It needs to be investigated further to see if the issue may affect networks.

Additional context

The OOM is due to not releasing an FCQueue for expirable transaction records. See also #11364

Hedera network

other

Version

v0.47.0-SNAPSHOT

Operating system

Linux

OlegMazurov commented 6 months ago

Two approaches:

OlegMazurov commented 3 weeks ago

An OOM with these symptoms was observed in a performance network when testing v0.51.2. Relevant log messages from the teacher node:

2024-06-22 00:12:40.462 292643   INFO  RECONNECT        <<platform-core: SyncProtocolWith1 3 to 1>> ReconnectTeacher: Starting reconnect in the role of the sender {"receiving":false,"nodeId":3,"otherNodeId":1,"round":302201} [com.swirlds.logging.legacy.payload.ReconnectStartPayload]
...
2024-06-22 01:03:38.788 301244   INFO  RECONNECT        <<platform-core: SyncProtocolWith1 3 to 1>> TeachingSynchronizer: sending tree rooted at com.swirlds.virtualmap.internal.merkle.VirtualRootNode with route [0 -> 32 -> 1]   -- last RECONNECT log message
...
2024-06-22 01:52:55.573 308662   ERROR EXCEPTION        <platformForkJoinThread-9> PlatformBuilder: Uncaught exception on thread Thread[#250,platformForkJoinThread-9,5,platform]: java.lang.OutOfMemoryError: Java heap space