penumbra-zone / penumbra

Penumbra is a fully private proof-of-stake network and decentralized exchange for the Cosmos ecosystem.
https://penumbra.zone
Apache License 2.0
358 stars 289 forks source link

Chain halt on 048-carme #2156

Closed conorsch closed 1 year ago

conorsch commented 1 year ago

The Carme testnet (#2117) halted at block 7909. Below are log excerpts.

Tendermint logs:

About 45m prior, these stood out as exceptional:

I[2023-03-14|11:29:32.591] failed attempting to add vote                module=consensus err="expected 7450/1/2, but got 7450/0/2: unexpected step"
E[2023-03-14|11:29:32.591] failed to process message                    module=consensus height=7451 round=0 peer=564a8ab4229b013b370fb6715fca9a6566969512 msg_type=*consensus.VoteMessage err="error adding vote"
I[2023-03-14|11:29:32.591] failed attempting to add vote                module=consensus err="expected 7450/1/2, but got 7450/0/2: unexpected step"
E[2023-03-14|11:29:32.591] failed to process message                    module=consensus height=7451 round=0 peer=564a8ab4229b013b370fb6715fca9a6566969512 msg_type=*consensus.VoteMessage err="error adding vote"
I[2023-03-14|11:29:32.591] failed attempting to add vote                module=consensus err="expected 7450/1/2, but got 7450/0/2: unexpected step"
E[2023-03-14|11:29:32.591] failed to process message                    module=consensus height=7451 round=0 peer=564a8ab4229b013b370fb6715fca9a6566969512 msg_type=*consensus.VoteMessage err="error adding vote"
I[2023-03-14|11:29:32.591] failed attempting to add vote                module=consensus err="expected 7450/1/2, but got 7450/0/2: unexpected step"

Immediately prior to halt:

I[2023-03-14|12:16:31.532] committed state                              module=state height=7908 num_txs=0 app_hash=1538470AB75FE1E6D66A4336FF1890AB320C26C145FE38C12E585A57A77E1261
I[2023-03-14|12:16:31.541] indexed block exents                         module=txindex height=7908
I[2023-03-14|12:16:31.884] Inbound Peer rejected                        module=p2p err="auth failure: handshake failed: EOF" numPeers=19
I[2023-03-14|12:16:35.912] Inbound Peer rejected                        module=p2p err="auth failure: handshake failed: EOF" numPeers=19
I[2023-03-14|12:16:36.254] Timed out                                    module=consensus dur=4.716886403s height=7909 round=0 step=RoundStepNewHeight
I[2023-03-14|12:16:36.430] received proposal                            module=consensus proposal="Proposal{7909/0 (1DC92EE81D233C090F4D3F6D3872CED9DEEE1963A36A6241C51F1CDEAF43AA72:1:9282B954462D, -1) 2C18C949F915 @ 2023-03-14T12:16:36.276167345Z}"
I[2023-03-14|12:16:36.431] received complete proposal block             module=consensus height=7909 hash=1DC92EE81D233C090F4D3F6D3872CED9DEEE1963A36A6241C51F1CDEAF43AA72
I[2023-03-14|12:16:36.842] finalizing commit of block                   module=consensus height=7909 hash=1DC92EE81D233C090F4D3F6D3872CED9DEEE1963A36A6241C51F1CDEAF43AA72 root=1538470AB75FE1E6D66A4336FF1890AB320C26C145FE38C12E585A57A77E1261 num_txs=0
I[2023-03-14|12:16:36.904] executed block                               module=state height=7909 num_valid_txs=0 num_invalid_txs=0
I[2023-03-14|12:16:36.952] committed state                              module=state height=7909 num_txs=0 app_hash=4E20E34784DC2FB45E966F328CF13B71744E2E45D5FB8AAC95481D9BE68971A0
I[2023-03-14|12:16:36.960] indexed block exents                         module=txindex height=7909
I[2023-03-14|12:16:37.488] Inbound Peer rejected                        module=p2p err="auth failure: handshake failed: EOF" numPeers=19

Pd logs, from early on in the chain, show some proof verification failures—presumably unrelated to chain halt, but still interesting:

2023-03-14T06:59:31.383270Z  INFO abci:CheckTx{kind=New txid="089dd3064d5c66ac6eb03ef7b437646ecafc2a354409334e3e7dcc199ce15913"}: tx rejected e=a spend proof did not verify

Caused by:
    spend proof did not verify elapsed=13.754718ms
2023-03-14T06:59:31.383901Z  INFO grpc{remote_addr=45.88.223.97:59442}: Response {
    code: Err(
        1,
    ),
    data: b"",
    log: "a spend proof did not verify: spend proof did not verify",
    hash: Hash::Sha256(089DD3064D5C66AC6EB03EF7B437646ECAFC2A354409334E3E7DCC199CE15913),
}
2023-03-14T07:02:42.557781Z  INFO abci:CheckTx{kind=New txid="4d70cadab885a620912da9da748af059b3ee47840a782e78914f8435fe127fcf"}: tx rejected e=output proof did not verify elapsed=10.099148ms
2023-03-14T07:02:42.558547Z  INFO grpc{remote_addr=45.88.223.97:59812}: Response {
    code: Err(
        1,
    ),
    data: b"",
    log: "output proof did not verify",
    hash: Hash::Sha256(4D70CADAB885A620912DA9DA748AF059B3EE47840A782E78914F8435FE127FCF),
}

Then, pd logs just after the halt:

2023-03-14T12:42:40.451577Z  INFO abci:CheckTx{kind=New txid="db73852e2f6e168b71d4055eb87955013fdba6699171732c543411abcda1762a"}: tx rejected e=nullifier f26e39786077488c31e85a1b0e5f9195da2cca926fdea9e12271eb0739e92e02 was already spent in NoteSource::Transaction(f39009aad6a878c26e0154bcce9161e4269896d8f8dac5ee66eda4911a7255df) elapsed=11.941678ms
2023-03-14T12:42:40.452228Z  INFO grpc{remote_addr=38.242.235.184:48092}: Response {
    code: Err(
        1,
    ),
    data: b"",
    log: "nullifier f26e39786077488c31e85a1b0e5f9195da2cca926fdea9e12271eb0739e92e02 was already spent in NoteSource::Transaction(f39009aad6a878c26e0154bcce9161e4269896d8f8dac5ee66eda4911a7255df)",
    hash: Hash::Sha256(DB73852E2F6E168B71D4055EB87955013FDBA6699171732C543411ABCDA1762A),
}
2023-03-14T12:42:42.511116Z  INFO abci:CheckTx{kind=New txid="c550d2fd93f3e77508d261dcf214bf557019b49c8df04389a5074d5b519bb32b"}: tx rejected e=nullifier a4b62586c9d5fa92eca28133e7e9bb42e5d370bd28e775e18f38ee5042c91d0c was already spent in NoteSource::Transaction(1d56ca53120eba5dff132c45aff4e35543af90c0826b6061480233091acaa6a6) elapsed=11.461898ms
2023-03-14T12:42:42.511494Z  INFO grpc{remote_addr=149.102.148.189:46904}: Response {
    code: Err(
        1,
    ),
    data: b"",
    log: "nullifier a4b62586c9d5fa92eca28133e7e9bb42e5d370bd28e775e18f38ee5042c91d0c was already spent in NoteSource::Transaction(1d56ca53120eba5dff132c45aff4e35543af90c0826b6061480233091acaa6a6)",
    hash: Hash::Sha256(C550D2FD93F3E77508D261DCF214BF557019B49C8DF04389A5074D5B519BB32B),
}

You can grab the 11MB flat file of all the logs from fn0 here: https://gist.github.com/conorsch/abe6bb67d13ba33b5201691614bb466d To obtain the full dump of logs from fn0, I ran kubectl get pods, then copy/pasted the fn-0 pod name into the command: kubectl logs --all-containers penumbra-testnet-fn-0-nvrbx > logs-all.txt.

hdevalence commented 1 year ago

We determined that the cause was a single validator achieving a >2/3 supermajority of stake and then not producing blocks.