Closed conorsch closed 11 months ago
I[2023-09-19|13:52:09.591] Saving AddrBook to file module=p2p book=/root/.penumbra/testnet_data/node0/tendermint/config/addrbook.json size=5 I[2023-09-19|13:52:09.594] Ensure peers module=pex numOutPeers=0 numInPeers=0 numDialing=0 numToDial=10 I[2023-09-19|13:52:39.594] Ensure peers module=pex numOutPeers=0 numInPeers=0 numDialing=0 numToDial=10 I[2023-09-19|13:53:09.593] Ensure peers module=pex numOutPeers=0 numInPeers=0 numDialing=0 numToDial=10
Look abnormal
Talked this over with @zbuc. Changes to make:
join
logic to take a random subset of the found nodes - https://github.com/penumbra-zone/penumbra/pull/3063Add more nodes to our default CI deployment
We're still only creating 2 validators (at genesis) and 2 full nodes, but the cluster is larger now so that we can add more nodes in the future with minimal fuss. Will monitor network graphs post-deploy.
We saw peers become exhausted pretty quickly, within ~1d of the 61 testnet being released:
I'm surprised that all the nodes we monitor have their peer counts incrementing in lockstep; the changes in #3063 were designed to prevent just that from happening. As a temporary workaround, I tried enabling seed_mode=true
on fn-0, which change is visible in the chart above with the plummeting green line (because seed nodes don't maintain peer connections, they merely share address info).
Will raise the limits on peering for our nodes even higher, but that's just kicking the can down the road. Looks like there's work upstream on a CometBFT bootstrap_peers
setting, which seems to fit our use case. Notably upstream's opinion on workarounds is to munge addrbook.json
:
We believe, however, that while making this config parameter available in v0.37 is helpful, its absence does not prevent any actual user case. Notice that it always possible to "manually" add peer addresses to the address book of a node before starting it.
Perhaps that's what pd testnet join
should be doing, rather than setting seeds
. Should we also have dedicated seed nodes? Should we raise the default max peer limits even higher?
Will raise the limits on peering for our nodes even higher
That's helping some, will keep an eye on the new limits.
Our config preserves the default setting allow_duplicate_ip=false
. My understanding of this option is that it raises the cost for someone to run many fullnode identities on a single host, by requiring each identity to have a unique IP address. A side-effect of this option, though, is that if someone runs pd testnet unsafe-reset-all
and rejoins the network from the same IP, peers will reject them as a duplicate IP. That's not the behavior we want out of a testnet.
Given our use of k8s deployments, it's also possible that a given fullnode can have its egress IP change if the pod is rescheduled on a different k8s node (i.e. not penumbra node) during the life of the testnet. That would also prevent others from peering well with it, if I understand the option correctly.
Advice within the tendermint/cometbft community boils down to 1) run seed nodes; and 2) write directly to address book, rather than use the "seeds" setting in the config.toml.
As a temporary workaround, I've manually created a seed node, and also declared a two-way persistent_peers
relationship between it and a fullnode. That should ensure that the seed node always has at least 1 well-connected node, and therefore can gossip reliably. The seed node info for testnet 61 is:
seeds = "tcp://2f49f0e386275635c4996cda69dc35081847c54b@seed-0.testnet.penumbra.zone:26656"
If folks are having trouble peering, running pd testnet join
, then editing the resulting CometBFT config to include the above should help. I've shared this info in Discord, and will make sure to capture any community feedback in this ticket.
The changes here have been a big improvement, but there's still some follow-up work tracked in #3161 that's worth implementing soon. It's worth noting that the seed changes in #3155 not only make first-run more reliable, but they also work automatically with seed configs generated by tenderseed, due to happenstance: https://github.com/binaryholdings/tenderseed/blob/10a64d571ff0868ea5c3f7381af2a0f6524e372c/internal/cmd/start.go#L116
Describe the bug A new node joining a testnet via
pd testnet join
will either get very few peers, or a whole bunch. We should stabilize this behavior.To Reproduce
Expected behavior I expect to receive a large number of peers, stably, on repeated attempts to join the network.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context The reason this is happening is because:
testnet join
logic inspects peer info from the bootstrap node address and adds those addresses to theseeds
line of the Tendermint config;We've also been using
seed_mode=true
on fn-0, which will automatically disconnect peering attempts after exchanging info about other peers. If we instead run more nodes, we should be able to trust Tendermint/CometBFT's address book functionality to track valid peers over the life of the network, rather than hardcoding a few known peers as "seeds" at join time.