upgrades: pd exits too early, before submitting pre-upgrade block

conorsch commented 6 months ago

Describe the bug As of #4339, the pd migrate operation inspects CometBFT's priv_validator_state.json file, and errors out of the height is higher than the halt height. That's a good sanity check, but it's possible that the constraint is violated.

During a recent migration test against an online devnet, I observed val-0 migrating successfully, but val-1 failed during migration, erroring with:

2024-05-21T19:10:42.069620Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: parsed genesis file genesis_json=Object {"genesis_time": String("2024-05-21T18:52:33.849663975Z"), "chain_id": String("penumbra-testnet-deimos-8-77aa47d1"), "initial_height": String("300"), "consensus_params":
 Object {"block": Object {"max_bytes": String("22020096"), "max_gas": String("-1"), "time_iota_ms": String("500")}, "evidence": Object {"max_age_num_blocks": String("100000"), "max_age_duration": String("86400000000000"), "max_bytes": String("1048576")}, "validator": Object {"pub_key_types": Array [String("ed25519")]}, "abci": Object {}}, "validators": Array [], "app_hash": Strin
g("47FF8AAAEE45784C6B6637DC46F1157010DAF9036D8047BA5508192A04FE5E55"), "app_state": Object {"genesisCheckpoint": String("R/+Kqu5FeExrZjfcRvEVcBDa+QNtgEe6VQgZKgT+XlU=")}}
2024-05-21T19:10:42.069672Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.json"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 300 is already greater than or equal to initial_height 300
command terminated with exit code 1

The contents of the priv_validator_state.json for val-1 was:

{
  "height": "300",
  "round": 0,
  "step": 2,
  "signature": "9dfsrfrUDqtdK+7MUbDgf85idj4orCzrfSZiC/wABWCdhU6cue/h741jptRd+NRJsrmp+R24soIsOEOKBEFBAg==",
  "signbytes": "3D0801112C010000000000002A0C0880DBB3B206108BDD8FA102322270656E756D6272612D746573746E65742D6465696D6F732D382D3737616134376431"
}

So pd did properly halt, but CometBFT still proceeded with preparing 300 block. Across the half of dozen end-to-end migration tests I've run in the past few days, I've encountered this situation only once, but it feels like a race condition: once pd halts, its exiting will close the abci port, and cause cometbft to crash, but it's possible cometbft has advanced slightly.

Fortunately, since pd was stopped, this shouldn't have any bearing on the app state. I believe the proper workaround, if this is encountered in the wild, is to manually decrement the height value in the priv val state by 1, then rerun the migration. More research required to understand what CometBFT is doing behind the scenes, and whether the priv val state is best understood as a WAL or as a watermark (or both).

conorsch commented 6 months ago

Talked this through with @erwanor earlier today, and he's already digging into it. The good news is, we don't believe this is a showstopper for the upcoming release (#4402).

conorsch commented 6 months ago

Caught it again:

2024-05-22T00:50:32.293269Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.jso
n"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 275 is already greater than or equal to initial_height 275
command terminated with exit code 1

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "275",
  "round": 0,
  "step": 2,
  "signature": "FJEQK3cNzlgsyl6/t7RAl3Mxn5maj+6uPaqungu/iQLC5RFw+LmrcPs57ZiDIjmDt49njCMqrdwds4vE8CMkCg==",
  "signbytes": "3D08011113010000000000002A0C08C6FCB4B20610F78DE6E001322270656E756D6272612D746573746E65742D6465696D6F732D382D3865653166333130"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "274",
  "round": 0,
  "step": 3,
  "signature": "SCmvwdmG/QK/r0tgtXGqSJgKTP+RHUAhKbwB2F9K5M56r5lFB5S52yVPijBnxVShtoZEXmZUWwt96IBb17S3AQ==",
  "signbytes": "8701080211120100000000000022480A203D9B6DDAEC9D347F0E3CF34DD7EF33A0CFB73A28B22BFD238C1F11DDF43CEC5C12240801122024A3D9F8A1092256B56FFA6DAC726AB1E56A21FF305FE01C3BCD6C4ED9780A022A0C08BDFCB4B2061090F2AF9E01322270656E756D6272612D746573746E65742D6465696D6F732D382D3865653166333130"
}

The two validators clearly have a different height in their cometbft configs, but both had stopped generating blocks at halt height, as expected. To work around the problem, I tried manually decrementing the value via perl -npi -E 's/.*height.*/ "height": "274",/' priv_validator_state.json and rerunning the migration. It again failed, because pd and cometbft disagreed about the height:

2024-05-22T00:56:43.980023Z  INFO pd_migrate:migrate{self=Testnet76 comet_home=Some("/penumbra-config/penumbra-devnet-val/node0/cometbft")}:migrate_comet_data: pd::migrate: writing genesis file to comet config genesis_file="/penumbra-config/penumbra-devnet-val/node0/cometbft/config/genesis.json"
Error: failed to upgrade state

Caused by:
    priv_validator_state height 274 is already greater than or equal to initial_height 1
command terminated with exit code 1

Next, I decremented the height to "0", and the migration completed successfully after that, but the network would not resume. Error messages include:

E[2024-05-22|01:11:21.021] Peer sent us invalid msg                     module=consensus peer="Peer{MConn{34.16.34.194:26656} 6bc340b9a9b2fbb9d1fac53c972973c22b252c75 out}" msg="[NewRoundStep H:275 R:0 S:RoundStepPrevote LCR:-1]" err="LastCommitRound can only be negative for initial height 1"
E[2024-05-22|01:11:21.021] Stopping peer for error                      module=p2p peer="Peer{MConn{34.16.34.194:26656} 6bc340b9a9b2fbb9d1fac53c972973c22b252c75 out}" err="LastCommitRound can only be negative for initial height 1"

Proceeding with more testing, but this problem is already persistent enough that we need a solution before shipping.

erwanor commented 6 months ago

priv_validator_state height 274 is already greater than or equal to initial_height 1

I think this is a different error though, running a migration over chain state that has already been migrated. In the future this should be caught by the halt bit flag.

conorsch commented 6 months ago

Fair, but even decrementing to 0 also didn't work. Should I perhaps have decremented to 1? My impulse is to loosen the constraint to LTE for the pd migrate height check, but I don't fully understand the implications of doing so. The devnet is dead in the water:

❯ cargo run -q --release --bin pcli -- --home ~/.local/share/pcli-devnet/ q chain info
Chain Info:

 Current Block Height   0
 Current Epoch          0
 Total Validators       2
 Active Validators      2
 Inactive Validators    0
 Jailed Validators      0
 Tombstoned Validators  0
 Disabled Validators    0

so we need a clear plan on how to handle this error.

hdevalence commented 6 months ago

No, none of this manually messing around with the height should be required. The behavior of the current code is correct.

The question that should be answered and isn’t is why Comet ends up past the halt height in the first place.

hdevalence commented 6 months ago

I believe the proper workaround, if this is encountered in the wild, is to manually decrement the height value in the priv val state by 1, then rerun the migration.

This is never the correct solution, as it asks the validator to double-sign a block.

hdevalence commented 6 months ago

Across the half of dozen end-to-end migration tests I've run in the past few days, I've encountered this situation only once, but it feels like a race condition: once pd halts, its exiting will close the abci port, and cause cometbft to crash, but it's possible cometbft has advanced slightly.

Comet shouldn’t be signing the next block immediately after getting a commit message from the application, unless the block time has been turned all the way down to zero, which means we are testing a situation we don’t care about. Can you confirm that this behavior occurs when using the intended 5-second block time?

erwanor commented 6 months ago

The migrate command is not idempotent (yet). So my guess of what happened here is that the exported chain state worked from a migrated chain state (with application height zero). It generated a genesis with initialHeight: 1. This isn't a problem with the priv validator state, even if the error messages look somewhat similar. It's a problem of corrupted chain state which is unfortunate, but should be solved altogether for the next upgrade.

@hdevalence We're looking into it. The current working theory is that this is a race condition.

There's a minuscule window of time between the moment we return Commit and when we interrupt the ABCI connection. This leaves a minuscule window of time for comet to produce and sign a new proposal and tick its signing state (the fact that the mempool is empty helps). AFAICT, it does so even before doing an app round trip to check that the proposal is correct.

erwanor commented 6 months ago

To answer your latest comment, AFAICT comet will actually update its signing state as soon as it creates a new proposal, even before checking that it is valid at all.

conorsch commented 6 months ago

Can you confirm that this behavior occurs when using the intended 5-second block time?

Yes, still using the default default 5s timeout-commit value. However, the devnet has customized values for proposal_voting_blocks=50 and epoch_duration=20.

AFAICT comet will actually update its signing state as soon as it creates a new proposal

This feels right to me: pd hasn't moved past halt-1, but cometbft updates its priv val state the halt height.

hdevalence commented 6 months ago

Why is comet creating a new proposal immediately?

What value of timeout_commit is in use? Should that not prevent Comet from immediately continuing?

conorsch commented 6 months ago

Encountered again. Target height in the upgrade plan was 540. Validators showed:

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "539",
  "round": 0,
  "step": 3,
  "signature": "6esGuH2dM9ZWQJKfAi2yTr0G9OrTGN/mwWwoo91eX5nVW/L1rqiQH5SEZK6yXKfBtAKRymyXsKX7KXfxkoEBDA==",
  "signbytes": "87010802111B0200000000000022480A202E5BB4D8B3D30CAF6D5019A8A84C5D3D4E68BAC1B3C34451B1510C957876322F122408011220C2FC1A22BD906AD4053369038BB83DD190E270404A9D4290B914F2EE9CA4
82992A0C08F6B1B5B20610A3EEC0D601322270656E756D6272612D746573746E65742D6465696D6F732D382D6563363732313937"
}

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "540",
  "round": 0,
  "step": 2,
  "signature": "bLvn3a/agsR9heeTaBk7kZuDCO0ZR5HMuadxj5MMAovOTiWuulDHyqjlFoZr41o/G0q53HMzEZfFDzk+CluBAw==",
  "signbytes": "3D0801111C020000000000002A0C0882B2B5B206109387FE9403322270656E756D6272612D746573746E65742D6465696D6F732D382D6563363732313937"
}

This time, before running the migration, I decremented height=540 to height=539, then ran the migration. The migration succeeded, and the chain resumed generating blocks without any errors. I understand that we want to avoid manual munging of priv val state, but providing this info to help us diagnose the root cause.

erwanor commented 6 months ago

Great, I'm going to log into the kube instance to inspect the comet WAL.

erwanor commented 6 months ago

Re @hdevalence: Yeah agreed, on second look the race condition story doesn't checkout since the commit timeout should slow down consensus. Moreover, re-reading the initial error, we can notice that we're already collecting prevotes (step=2):

{
  "height": "300",
  "round": 0,
  "step": 2,
sigdata
}

erwanor commented 6 months ago

@conorsch does this specific setup has a different restart cadence than the others? It shouldn't matter, but it might help pinpoint exactly the cause. ~~I notice that it's always val-1~~

erwanor commented 6 months ago

@conorsch can you suspend pd/comet without winding down the containers or have them restart? I want to acquire a lock on the comet db but it's runnning.

My hunch is that the "pd crash" logic that we have is bad, but need logs for conclusive evidence

erwanor commented 6 months ago

My hunch is that the "pd crash" logic that we have is bad, but need logs for conclusive evidence

I am certain that this is the problem. The full node should stop, period, but instead it keeps serving ABCI Info requests. This interacts with the restart logic, and after some time, the proposing node restarts, is able to query the full node, and observes that the timeout period has elapsed.

conorsch commented 6 months ago

@conorsch can you suspend pd/comet without winding down the containers or have them restart? I want to acquire a lock on the comet db but it's runnning.

The way we perform the migrations is to place the pods in "maintenanceMode", replacing the usual pd and cometbft commands with sleep infinity. That way, the same environment is available, but no services are running or touching the state. It's essentially equivalent to systemctl stop penumbra cometbft in a VM environment. I've done that on the still-running devnet, so you can poke around.

More relevant for our needs is probably the pre-upgrade snapshot archives, which are available for fetching here:

And less interesting but still available are the post-upgrade migration archives, taken immediately after the migration was run, but before the network was resumed:

My hunch is that the "pd crash" logic that we have is bad [...] The full node should stop, period, but instead it keeps serving ABCI Info requests.

It's worth noting the crash behavior we're investigating is in the v0.75.0 code. We've used the halt-count mechanism, now deprecated, to halt the chain successfully before. But I agree with you, we're not seeing the behavior I'd expect.

erwanor commented 6 months ago

The problem is that we don't actually halt the full node. We leave it in an intermediary state where it's able to serve certain requests, but not others. This is because we crash two of its four core services (Consensus and Mempool). We should stop all four, or they should never start, that's the correct behavior.

The node eventually crashes once comet start making consensus requests because it believes the application is "ready". This is really not ideal. Crucially, it's unrelated to whether we use a halt bit or a halt counter, or any other mechanism. The same issue will creep up regardless of what the "halting rule" is.

I will push a PR, but for today's update we will have to do without it, so we need to make sure that the validators do not have auto-restart enabled to avoid having to touch the signing state.

conorsch commented 6 months ago

First attempt at resolution was #4436, which successfully caused pd to exit completely, but a bit too early: on several test runs, pd exited before cometbft had broadcast the halt-minus-1 block. We observed fullnodes stalled at halt-minus-2, which poses problems for migration. This issues remains a problem, and should be resolved prior to proceeding with the next chain upgrade. Ideally, the fix for this issue will land in a point release 0.75.1, after which we can proceed with the scheduled 0.76.0 upgrade-via-migration.

penumbra-zone / penumbra

upgrades: pd exits too early, before submitting pre-upgrade block #4432