pd crash loop if tendermint exits between init_chain and first block

conorsch commented 1 year ago

Describe the bug The pd application will get stuck in a failed state if tendermint sends an init_chain event to pd, but then exits before the first block is handled and submitted to the app. On subsequent service start, tendermint will again send an init_chain event, because it hasn't sent a block before, causing pd to reject the second init_chain msg, crashing.

To Reproduce Steps to reproduce the behavior:

pd testnet unsafe-reset-all
pd testnet join
start node by your favorite method
kill tendermint process (e.g. via ctrl+c if run interactively) quickly (~<2s) after starting
restart tendermint process, observe error msg in pd logs:

thread 'tokio-runtime-worker' panicked at 'init_chain must succeed: database already initialized', /home/penumbra/penumbra/pd/src/consensus/worker.rs:45:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: consensus worker terminated or panicked', /home/penumbra/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-abci-0.2.0/src/server.rs:164:70

Expected behavior pd should resume sipping events from tendermint and not crash.

Additional context We believe that the problem of tendermint not sending blocks correlates with inability to find peers. The longer peers cannot be found, the more likely this crash loop is to occur. The current best workaround is to pd testnet unsafe-reset-all and try joining again.

To solve the problem, we can avoid calling commit in the app code until the first block has been processed. Quoth @hdevalence:

one possibility would be that we should not commit state changes after InitChain, and wait until we get the first BeginBlock/DeliverTx/EndBlock/Commit stanza for it to be committed

Currently we use a u64 as version, where the max value indicates an uninitialized state:

https://github.com/penumbra-zone/penumbra/blob/3fdb59a21eb2a8890c14d523308c79409ec0aeb8/storage/src/storage.rs#L118-L124

If we change that logic, we should be mindful of off-by-one errors in other parts of the code that reference of the version.

conorsch commented 1 year ago

To solve the problem, we can avoid calling commit in the app code until the first block has been processed.

Further discussion with Tendermint experts confirms this is the right approach. If Tendermint is halted early, and submits init_chain, the app should respond kindly.

conorsch commented 1 year ago

I plan to pair with @plaidfinch on this one; we'll report progress as we go.

conorsch commented 1 year ago

Some notes from initial discussion. We should hash the InitChain request and save the hash to non-consensus storage. Doing so would allow us to at any point in the future to compare an InitChain request and determine what response to make: if the app has received an InitChain request in the past, but not yet committed anything, and the current request matches the previous, then it's OK to proceed. Otherwise, we fail loudly.

hdevalence commented 1 year ago

Wait, why do we need to do that, instead of just not calling commit() at the end of InitChain?

We know we'll never get a duplicate InitChain message from Tendermint, that would be illegal, and we already have to trust Tendermint's ABCI messages, so I don't think there's a lot of benefit to building a custom system to be able to double check, rather than just solving the bug in our code we're currently seeing.

conorsch commented 1 year ago

In order to handle an InitChain request, we must pass back an InitChain response, which requires an app hash. To generate that app hash, we commit to storage, during which action the JMT root hash is computed. So our choices appear to be a) reimplement the save-to-JMT logic but without actually saving to storage, which feels error-prone; or b) hash the InitChain request for comparison later. During our discussion, we opted for b) as the cleaner approach.

hdevalence commented 1 year ago

Hmm, but the app hash is app-defined, what if we simply returned [0u8; 32] in the InitChain response, and waited for the first Commit to compute a real one?

conorsch commented 1 year ago

Will give that a shot!

hdevalence commented 1 year ago

Alternatively, if the all-zero array is used by Go/Tendermint as a default value, we could use [255u8; 32] (all ones); either works just as well for our purpose, I think.

plaidfinch commented 1 year ago

Shouldn't the app hash bind the consensus chain state? If we do this, then in this special case, the app hash is not binding the contents of the JMT, which seems like it could lead to problems.

hdevalence commented 1 year ago

There's no consensus state yet, though, because commit has never been called.

plaidfinch commented 1 year ago

Oh, I see! Then if there's a consensus divergence, it will occur after the first commit, which happens after the first block is processed. This seems like the best option to me, let's do it.

conorsch commented 1 year ago

I wasn't able to get the fudged-hash approach working. Using the [0u8; 32] approach fails with an app hash mismatch from tendermint:

Feb 07 08:58:14 Antigonus tendermint[260242]: E[2023-02-07|08:58:14.301] Error in validation module=blockchain err="wrong Block.Header.AppHash. Expected 0000000000000000000000000000000000000000000000000000000000000000, got 9D0C36F455A5E8BE6C0EE48E0EE18761C494BB81C531258C28754CE00B26BE28"

Then, with 255s rather than 0s:

Feb 07 09:05:37 Antigonus tendermint[274685]: E[2023-02-07|09:05:37.747] Error in validation module=blockchain err="wrong Block.Header.AppHash. Expected FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, got 9D0C36F455A5E8BE6C0EE48E0EE18761C494BB81C531258C28754CE00B26BE28"

The expected app hash is the same between both, so I'm still inclined to using the sidecar storage as described above. Perhaps I'm missing something; I posted the small changes involved, as well as a testing script to aid in repro, in https://github.com/penumbra-zone/penumbra/pull/1959. Would appreciate an opportunity to discuss further.

conorsch commented 1 year ago

De-prioritized in favor of #1843; leaving research here for a later date. This problem isn't critical.

conorsch commented 1 year ago

This problem reared its head again during Testnet 61, due to trouble peering (#3056). Would still be nice to have a solution to this one.

erwanor commented 3 months ago

We could fix this by computing the app hash without committing the state. This is possible since we added prepare_commit in https://github.com/penumbra-zone/penumbra/issues/4095 (more context there).

penumbra-zone / penumbra

pd crash loop if tendermint exits between init_chain and first block #1884