Over the weekend, seed.koinos.io got stuck. The reason why it got stuck is unclear, but what was clear was why it was unable to recover.
The p2p node had a few other peers who, presumably, are also stuck. The node would reconnect to a peer ahead of the node, but because the majority of connected peers were also stuck, gossip remained enabled. The node would then get spammed by pending transactions, all of which would fail because the mempool was full and those addresses had no more mana remaining according to the mempool. This would cause the node to disconnect before it could receive even a single sync block. The pattern would continue while continuing to disconnect from all peers.
While a fairly niche issue, the node quickly recovered by restarting the p2p and mempool microservices. The node was able to connect to enough peers to disable gossip and begin syncing again.
Expected behavior
The node should be able to recover and a failure in the node should not cause unrecoverable scenarios.
Possible solutions are:
Expire old transactions in the mempool by wall clock time.
Turn off p2p gossip when the head block is older than a certain age.
Is there an existing issue for this?
Current behavior
Over the weekend, seed.koinos.io got stuck. The reason why it got stuck is unclear, but what was clear was why it was unable to recover.
The p2p node had a few other peers who, presumably, are also stuck. The node would reconnect to a peer ahead of the node, but because the majority of connected peers were also stuck, gossip remained enabled. The node would then get spammed by pending transactions, all of which would fail because the mempool was full and those addresses had no more mana remaining according to the mempool. This would cause the node to disconnect before it could receive even a single sync block. The pattern would continue while continuing to disconnect from all peers.
While a fairly niche issue, the node quickly recovered by restarting the p2p and mempool microservices. The node was able to connect to enough peers to disable gossip and begin syncing again.
Expected behavior
The node should be able to recover and a failure in the node should not cause unrecoverable scenarios.
Possible solutions are:
Expire old transactions in the mempool by wall clock time.
Turn off p2p gossip when the head block is older than a certain age.
Steps to reproduce
No response
Environment
Anything else?
No response