Fix/5193 stackerdb decoherence

jcnelson commented 2 months ago

This fixes #5193 by having all p2p state machines (namely, both epoch 2.x and Nakamoto inv sync and StackerDB) track and report their pinned connections to the peer network, so they won't be pruned. The cause of the decoherency seems to have been that once a peer's outbound neighbor count exceeded [connection_opts].soft_max_neighbors_per_org or one of the other similar limits, the pruner would simply close the newer connections until the number of connections was brought down. This would often happen during StackerDB sync (and would also happen in inv sync), which would have the effect of a node with many neighbors failing to synchronize their StackerDB replicas.

This I suspect was also the cause of the decoherence we would see with larger Nakamoto testnets, where the soft limits on the number of neighbors were exceeded.

You can see the effect of this PR in /v2/neighbors -- inbound and outbound peer entries now report an age (in seconds), which should rarely be reset due to the pinning. Before, neighbors would come and go very quickly as state machines connected to them and the pruner immediately disconnected them.

Leaving as a draft for now so I can test this live with the Nakamoto testnet signers.

jcnelson commented 1 month ago

I am testing this on mainnet along with my other in-flight PRs, and I think I'm getting OOM'ed. I need to confirm first.

wileyj commented 1 month ago

I am testing this on mainnet along with my other in-flight PRs, and I think I'm getting OOM'ed. I need to confirm first.

will also run this branch to see if i can reproduce

blockstack-devops commented 3 weeks ago

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

stacks-network / stacks-core

Fix/5193 stackerdb decoherence #5197