[Network] StackerDB decoherence

jcnelson commented 2 months ago

For reasons that are not yet clear, Nakamoto testnet and mainnet StackerDB replicas will eventually lose coherence. Writes to one replica do not find their way to others -- neither via push, nor via sync. This needs investigation, and may be partially fixed by #5191.

diwakergupta commented 2 months ago

I've observed something on a signer node that might be related. Noting here for tracking, happy to move to a new issue if that's more appropriate. Note that I've already sought input and debugging help from @hstove and @jferrant on this.

The setup:

stacks-node v2.5.0.0.6, running as a follower, with stacker = true
stacks-signer v2.5.0.0.5.2
neither services are exposed publicly, but they have full outbound connectivity

I'm running the binaries directly, co-located on the same machine. There's also a dedicated bitcoind. This setup has been running for several months at this point, without any problems.

Symptoms:

Jacinta's tool for reporting missing signers includes my signer's address when she ran on 2 separate mainnet nodes
Same tool when run against my node correctly reports my signer's address. In fact, the delta was only my signer's address (when compared to one of the nodes above)
There are no warnings or errors in either my node or signer logs. Signer correctly logs "Mock signing for burn block ..." on new burn blocks
My node's /v2/neighbor reports plenty of nodes with non-empty stackerdb entries. I can include a full output if that helps.

jcnelson commented 2 months ago

I think I know the reason for this now. The network pruner starts removing new connections after 10 outbound peers have been found (this is the default limit). Network subsystems have a way of "pinning" connections so they won't get pruned while they're in use, but there was a bug in the way the pinning system worked which had a very immediate and noticeable impact on StackerDB (especially since a signer or miner would be running a couple dozen replicas). I'll have a patch out soon, once I'm done testing it.

diwakergupta commented 2 months ago

Based on the draft PR, would a workaround be to increase soft_max_neighbors_per_org -- happy to test that out if that helps.

wileyj commented 1 month ago

closing since #5197 is merged

blockstack-devops commented 3 weeks ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

stacks-network / stacks-core

[Network] StackerDB decoherence #5193