Centralize some of the consensus timing config fields.

SpicyLemon commented 3 months ago

Summary

There's several consensus timing fields in the config file (in config.toml) that we'd like to standardize across nodes.

Problem Definition

We see a lot of sporadic block times and it's possibly related to various consensus. settings which each node can define for itself.

Proposal

There's a couple things we can do depending on how much effort we want to put in.

(low effort) Hard-code the values in our config management stuff. This will involve providing the hard-coded values to viper.
(more effort) Create some new params to hold them and store them somewhere. Set them in the next upgrade, and provide endpoints for getting and setting-via-gov-prop.

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

SpicyLemon commented 1 month ago

Related to #1824

iramiller commented 1 month ago

Based on input from @scirner22 we should set timeout_commit to 3500ms for pio-mainnet-1 as this value looks like it would conservatively pull the network into alignment around our desired 4s block time. There are still outliers that are causing additional consensus rounds which is what is really pulling our block times up currently however the hope is with further standardization we will see more uniform performance. We are likely to want to further adjust this value in future software upgrades based on network performance.

SpicyLemon commented 1 month ago

I feel like there's more to this than just the timeout_commit field.

All of this is from v1.19.1.

Disclaimer: I thought about removing the values from the stuff in this comment, because I don't want them to be viewed as suggestions or taken to be at all correct or accurate. I only left them here to show what type of field each is. I don't know where I got them and I don't actually run any nodes using these config values.

Here's the consensus section from the config file:

#######################################################
###         Consensus Configuration Options         ###
#######################################################
[consensus]

wal_file = "data/cs.wal/wal"

# How long we wait for a proposal block before prevoting nil
timeout_propose = "3s"
# How much timeout_propose increases with each round
timeout_propose_delta = "500ms"
# How long we wait after receiving +2/3 prevotes for “anything” (ie. not a single block or nil)
timeout_prevote = "1s"
# How much the timeout_prevote increases with each round
timeout_prevote_delta = "500ms"
# How long we wait after receiving +2/3 precommits for “anything” (ie. not a single block or nil)
timeout_precommit = "1s"
# How much the timeout_precommit increases with each round
timeout_precommit_delta = "500ms"
# How long we wait after committing a block, before starting on the new
# height (this gives us a chance to receive some more precommits, even
# though we already have +2/3).
timeout_commit = "4.3s"

# How many blocks to look back to check existence of the node's consensus votes before joining consensus
# When non-zero, the node will panic upon restart
# if the same consensus key was used to sign {double_sign_check_height} last blocks.
# So, validators should stop the state machine, wait for some blocks, and then restart the state machine to avoid panic.
double_sign_check_height = 0

# Make progress as soon as we have all the precommits (as if TimeoutCommit = 0)
skip_timeout_commit = false

# EmptyBlocks mode and possible interval between empty blocks
create_empty_blocks = true
create_empty_blocks_interval = "0s"

# Reactor sleep duration parameters
peer_gossip_sleep_duration = "100ms"
peer_query_maj23_sleep_duration = "2s"

Here they are again in a more concise list (output of provenanced config get consensus).

consensus.create_empty_blocks=true
consensus.create_empty_blocks_interval="0s"
consensus.double_sign_check_height=0
consensus.peer_gossip_sleep_duration="100ms"
consensus.peer_query_maj23_sleep_duration="2s"
consensus.skip_timeout_commit=false
consensus.timeout_commit="4.3s"
consensus.timeout_precommit="1s"
consensus.timeout_precommit_delta="500ms"
consensus.timeout_prevote="1s"
consensus.timeout_prevote_delta="500ms"
consensus.timeout_propose="3s"
consensus.timeout_propose_delta="500ms"
consensus.wal_file="data/cs.wal/wal"

SpicyLemon commented 1 month ago

I found a nice graphic of the block lifecycle. It doesn't mention the timeout config stuff, but it shouldn't be too hard to see where each applies.

tendermint_in_a_nutshell

Source: https://docs.ipc.space/specifications/cometbft -> https://docs.tendermint.com/v0.34/assets/img/tm-transaction-flow.258ca020.png

iramiller commented 1 month ago

sure ... we can force set all the consensus configuration parameters if we want to ... I am not sure why they ever allowed nodes to modify these settings on their own to begin with vs setting them in the genesis/protocol config.

For reference these are the defaults for Osmosis (since I had them at hand)

timeout_propose = "3s"
timeout_propose_delta = "500ms"
timeout_prevote = "1s"
timeout_prevote_delta = "500ms"
timeout_precommit = "1s"
timeout_precommit_delta = "500ms"
timeout_commit = "5s"
double_sign_check_height = 0
skip_timeout_commit = false
create_empty_blocks = true
create_empty_blocks_interval = "0s"
peer_gossip_sleep_duration = "100ms"
peer_query_maj23_sleep_duration = "2s"

Of note the only one that is different here from what you have above is the timeout_commit which we are proposing to standardize above. This is a value we have messed with over time so it is the most likely to be varied across our network.

SpicyLemon commented 1 month ago

Looking at that chart, I wonder if we should look at increasing timeout_propose.

If we had stats on what's actually causing the extra rounds, we could better react here.

If we're seeing 1/3+ nil pre-votes, then we might need to bump timeout_propose so nodes have a better chance of receiving the proposed block.
If we're seeing +2/3 pre-votes, but 1/3+ nil pre-commits, we might need to bump timeout_prevote. That's the amount of time to wait AFTER getting +2/3 of the pre-votes (nil or not), before giving up on pre-votes and moving on with what it's got.
If we're seeing +2/3 pre-commits, but the block is still being rejected, we might need to bump timeout_precommit. That's the amount of time to wait AFTER getting +2/3 of the pre-commits (nil or not), before giving up on pre-commits and the proposed block.

In that graphic, I'm not entirely sure what timeout_commit governs, though. The timeout_commit comment states:

How long we wait after committing a block, before starting on the new height (this gives us a chance to receive some more precommits, even though we already have +2/3).

That makes it sound like timeout_commit is the minimum amount of time from the commitment of one block to the commitment of the next block. I.e. the entire loop, not counting the amount of time it takes to actually commit all the data, and possibly covering multiple trips around the loop when more than one round is needed for a block. But "before starting on the new height" could mean many things; it specifically states it can help gather extra pre-commits, so it more than likely covers that part of the loop. And it's also possible that "after committing a block" isn't 100% technically accurate. I.e. maybe it actually starts at the start of a round or at some other point in the loop that isn't mentioned on that graphic.

Here's my take on how the round lifecycle happens:

A new round starts, and a proposer is selected.
Start a timer for the timeout_propose and start waiting for pre-votes and the proposed next block.
If the proposed next block comes in, validate it, broadcast a pre-vote for it (assuming it's valid), and stop the propose timer.
If the propose timer expires, send a nil pre-vote.
Wait for pre-votes (nil or not) from +2/3 of the nodes.
Start a timer for timeout_prevote.
If the node gets pre-votes from all the nodes, but there isn't +2/3 consensus, send a nil pre-commit and cancel the pre-vote timer.
If the pre-vote timer expires, send a nil pre-commit.
If +2/3 of the nodes have pre-voted for the same block, send a pre-commit for it and cancel the pre-vote timer.
Wait for pre-commits (nil or not) from +2/3 of the nodes.
Start a timer for timeout_precommit.
If we get pre-commits from all the nodes, but there isn't +2/3 consensus, cancel the pre-commit timer and start a new round.
If the pre-commit timer expires, start a new round.
if +2/3 of the nodes have pre-committed for the same block, cancel the pre-commit timer.
The block has been accepted.
Wait for the commit timer to expire.
Commit the new block.
Start a timer for the timeout_commit. This will carry over into the next round(s).

That's my interpretation of the process based on that graphic and the config file comments and could easily be all sorts of wrong. For example, maybe step 13 also involves waiting for the commit timer to expire. Or maybe the commit timer starts with step 1. There's lots of ways I can be wrong.

iramiller commented 1 month ago

It seems likely that the proposers are not getting votes to meet the majority (i.e. it is being rejected) and the previous proposer runs again. We are seeing round 2 in the failure cases. Dashboard showing rounds here

SpicyLemon commented 1 month ago

Here's another lifecycle diagram that might be useful, but again, I'm not sure how timeout_commit factors into it.

tendermint_consensus_logic

Source: https://docs.tendermint.com/v0.34/introduction/what-is-tendermint.html#consensus-overview

SpicyLemon commented 1 month ago

It seems likely that the proposers are not getting votes to meet the majority (i.e. it is being rejected) and the previous proposer runs again. We are seeing round 2 in the failure cases. Dashboard showing rounds here

The docs make it sound like it's supposed to pick a different proposer for each round, but maybe that's not always the case.

From what I can tell, the timeout_commit is a minimum amount of time between blocks, but the other timeout config fields are maximum waiting times. I.e. the timer from timeout_commit is always run down to zero, but the timers for the others might be cancelled early if certain conditions are met.

I'm sure that standardizing timeout_commit on all the nodes will help. As you pointed out, we've changed our recommendation on it many times (and the default is 1s). If a node has a low value here, it might commit the proposed block and start a new round well before other nodes. So it could be waiting for the next block proposal for a second or two before the proposer even starts putting together the next proposal. That'd make it more likely to not get the proposal in time and broadcast a nil pre-vote and cause that round to fail.

While writing this, I was considering a timeout_commit of 1s (or less), but that scares me a bit. The last time I fired up a node and had it catch up, it seemed like I was hitting a rate limiter somewhere. I don't know if it's something baked into the SDK's catch-up stuff, or if it's something with how the servers are configured (that are running the nodes). Basically, it'd get a chunk of blocks, then wait a bit, and get a chunk more. But it wasn't averaging much more than a block a second. We need to make sure that we can always replay blocks to catch up faster than we're creating new blocks.

SpicyLemon commented 1 month ago

I'll start putting together the PR to basically hard-code the timeout_commit.

I was thinking it'd be really cool if we could store this stuff in state (updatable through gov prop), but in order to get to the point where we'd be able to read them from state, we would have already created a bunch of the stuff that uses the config values, and I highly doubt there's a way to inform all that stuff of changes to their config.

iramiller commented 1 month ago

Pretty sure I saw something coming in SDK52 or similar that is moving this stuff into consensus params on chain… so a temporary forced config might make sense here.

provenance-io / provenance