stacks-network / stacks-core

The Stacks blockchain implementation
https://docs.stacks.co
GNU General Public License v3.0
3.01k stars 667 forks source link

Fix/5193 stackerdb decoherence #5197

Closed jcnelson closed 5 days ago

jcnelson commented 1 week ago

This fixes #5193 by having all p2p state machines (namely, both epoch 2.x and Nakamoto inv sync and StackerDB) track and report their pinned connections to the peer network, so they won't be pruned. The cause of the decoherency seems to have been that once a peer's outbound neighbor count exceeded [connection_opts].soft_max_neighbors_per_org or one of the other similar limits, the pruner would simply close the newer connections until the number of connections was brought down. This would often happen during StackerDB sync (and would also happen in inv sync), which would have the effect of a node with many neighbors failing to synchronize their StackerDB replicas.

This I suspect was also the cause of the decoherence we would see with larger Nakamoto testnets, where the soft limits on the number of neighbors were exceeded.

You can see the effect of this PR in /v2/neighbors -- inbound and outbound peer entries now report an age (in seconds), which should rarely be reset due to the pinning. Before, neighbors would come and go very quickly as state machines connected to them and the pruner immediately disconnected them.

Leaving as a draft for now so I can test this live with the Nakamoto testnet signers.

jcnelson commented 1 week ago

I am testing this on mainnet along with my other in-flight PRs, and I think I'm getting OOM'ed. I need to confirm first.

wileyj commented 1 week ago

I am testing this on mainnet along with my other in-flight PRs, and I think I'm getting OOM'ed. I need to confirm first.

will also run this branch to see if i can reproduce