Open conorsch opened 1 year ago
To solve the problem, we can avoid calling commit in the app code until the first block has been processed.
Further discussion with Tendermint experts confirms this is the right approach. If Tendermint is halted early, and submits init_chain, the app should respond kindly.
I plan to pair with @plaidfinch on this one; we'll report progress as we go.
Some notes from initial discussion. We should hash the InitChain request and save the hash to non-consensus storage. Doing so would allow us to at any point in the future to compare an InitChain request and determine what response to make: if the app has received an InitChain request in the past, but not yet committed anything, and the current request matches the previous, then it's OK to proceed. Otherwise, we fail loudly.
Wait, why do we need to do that, instead of just not calling commit() at the end of InitChain?
We know we'll never get a duplicate InitChain message from Tendermint, that would be illegal, and we already have to trust Tendermint's ABCI messages, so I don't think there's a lot of benefit to building a custom system to be able to double check, rather than just solving the bug in our code we're currently seeing.
In order to handle an InitChain request, we must pass back an InitChain response, which requires an app hash. To generate that app hash, we commit to storage, during which action the JMT root hash is computed. So our choices appear to be a) reimplement the save-to-JMT logic but without actually saving to storage, which feels error-prone; or b) hash the InitChain request for comparison later. During our discussion, we opted for b) as the cleaner approach.
Hmm, but the app hash is app-defined, what if we simply returned [0u8; 32] in the InitChain response, and waited for the first Commit to compute a real one?
Will give that a shot!
Alternatively, if the all-zero array is used by Go/Tendermint as a default value, we could use [255u8; 32] (all ones); either works just as well for our purpose, I think.
Shouldn't the app hash bind the consensus chain state? If we do this, then in this special case, the app hash is not binding the contents of the JMT, which seems like it could lead to problems.
There's no consensus state yet, though, because commit has never been called.
Oh, I see! Then if there's a consensus divergence, it will occur after the first commit
, which happens after the first block is processed. This seems like the best option to me, let's do it.
I wasn't able to get the fudged-hash approach working. Using the [0u8; 32]
approach fails with an app hash mismatch from tendermint:
Feb 07 08:58:14 Antigonus tendermint[260242]: E[2023-02-07|08:58:14.301] Error in validation module=blockchain err="wrong Block.Header.AppHash. Expected 0000000000000000000000000000000000000000000000000000000000000000, got 9D0C36F455A5E8BE6C0EE48E0EE18761C494BB81C531258C28754CE00B26BE28"
Then, with 255s rather than 0s:
Feb 07 09:05:37 Antigonus tendermint[274685]: E[2023-02-07|09:05:37.747] Error in validation module=blockchain err="wrong Block.Header.AppHash. Expected FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, got 9D0C36F455A5E8BE6C0EE48E0EE18761C494BB81C531258C28754CE00B26BE28"
The expected app hash is the same between both, so I'm still inclined to using the sidecar storage as described above. Perhaps I'm missing something; I posted the small changes involved, as well as a testing script to aid in repro, in https://github.com/penumbra-zone/penumbra/pull/1959. Would appreciate an opportunity to discuss further.
De-prioritized in favor of #1843; leaving research here for a later date. This problem isn't critical.
This problem reared its head again during Testnet 61, due to trouble peering (#3056). Would still be nice to have a solution to this one.
We could fix this by computing the app hash without committing the state. This is possible since we added prepare_commit
in https://github.com/penumbra-zone/penumbra/issues/4095 (more context there).
Describe the bug The
pd
application will get stuck in a failed state if tendermint sends aninit_chain
event topd
, but then exits before the first block is handled and submitted to the app. On subsequent service start, tendermint will again send aninit_chain
event, because it hasn't sent a block before, causingpd
to reject the secondinit_chain
msg, crashing.To Reproduce Steps to reproduce the behavior:
pd testnet unsafe-reset-all
pd testnet join
ctrl+c
if run interactively) quickly (~<2s) after startingpd
logs:Expected behavior pd should resume sipping events from tendermint and not crash.
Additional context We believe that the problem of tendermint not sending blocks correlates with inability to find peers. The longer peers cannot be found, the more likely this crash loop is to occur. The current best workaround is to
pd testnet unsafe-reset-all
and try joining again.To solve the problem, we can avoid calling commit in the app code until the first block has been processed. Quoth @hdevalence:
Currently we use a u64 as version, where the max value indicates an uninitialized state:
https://github.com/penumbra-zone/penumbra/blob/3fdb59a21eb2a8890c14d523308c79409ec0aeb8/storage/src/storage.rs#L118-L124
If we change that logic, we should be mindful of off-by-one errors in other parts of the code that reference of the version.