Service restarts due to upgrades can destroy testnet deployment

conorsch commented 1 year ago

Over the weekend we saw a failure of Testnet 42 Adraste (#1877). After investigation, it appears that an automatic node pool upgrade destroyed the deployment at around 2023-02-05T05:45+00:00:

gke-auto-upgrade

Ostensibly this happened because we've set the cluster config node pool options to auto_upgrade=true, here: https://github.com/penumbra-zone/penumbra/blob/a0a6a5ca4e983886d9e058eef384c31a62bb0e2a/deployments/terraform/modules/node/v1/gke.tf#L47 but the root cause is that our initContainer logic for Tendermint keygen isn't idempotent. Let's update the latter so that we can safely restart a service from the same persistent volume and have things work just fine.

conorsch commented 1 year ago

On a clean tendermint init, we see the keyfiles and config:

tendermint init --home /tmp/sandbox-tm
I[2023-02-06|09:28:10.009] Generated private validator                  module=main keyFile=/tmp/sandbox-tm/config/priv_validator_key.json stateFile=/tmp/sandbox-tm/data/priv_validator_state.json
I[2023-02-06|09:28:10.009] Generated node key                           module=main path=/tmp/sandbox-tm/config/node_key.json
I[2023-02-06|09:28:10.009] Generated genesis file                       module=main path=/tmp/sandbox-tm/config/genesis.json

❯ tree /tmp/sandbox-tm
/tmp/sandbox-tm
├── config
│   ├── config.toml
│   ├── genesis.json
│   ├── node_key.json
│   └── priv_validator_key.json
└── data
    └── priv_validator_state.json

3 directories, 5 files

As soon as tendermint starts, it will create config/addrbook.json. Let's check for the existence of that file, and skip initialization if we find it, because clearly Tendermint will have started at least once before, so we should not touch its state. It might be as reasonable to gate on data/state.db.

conorsch commented 1 year ago

That seems to work well. To test, I deployed https://github.com/penumbra-zone/penumbra/commit/f0775c31e31439f0e804d0cf27908c3e8b8e0032 to preview, then pulled logs from the first validator in the preview deployment:

$ kubectl logs  --since=1h penumbra-testnet-preview-val-0-xg2ts --all-containers > val-0-before.log
$ head -n 15 val-0-before.log 
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-val-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-val-0
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ '[' '!' -d /home/.tendermint ]
+ tendermint init validator --home /home/.tendermint
I[2023-02-06|18:06:45.016] Generated private validator                  module=main keyFile=/home/.tendermint/config/priv_validator_key.json stateFile=/home/.tendermint/data/priv_validator_state.json
I[2023-02-06|18:06:45.017] Generated node key                           module=main path=/home/.tendermint/config/node_key.json
I[2023-02-06|18:06:45.017] Generated genesis file                       module=main path=/home/.tendermint/config/genesis.json
+ CONFIG_DIR=/home/.tendermint/config
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ '[' -d /home/tmpConfig/config ]

There we can see the key-init logic running. Then I killed the pod for the first validator, via kubectl delete pod penumbra-testnet-preview-val-0-xg2ts. The replicationcontroller automatically created a replacement, visible as the youngest of the set here:

$ kubectl get pods -l app.kubernetes.io/instance=penumbra-testnet-preview                
NAME                                   READY   STATUS    RESTARTS   AGE
penumbra-testnet-preview-fn-0-x9sw2    3/3     Running   0          12m
penumbra-testnet-preview-fn-1-5qxx8    3/3     Running   0          12m                  
penumbra-testnet-preview-val-0-dltcz   2/2     Running   0          61s                                                                                                                                                         
penumbra-testnet-preview-val-1-8cs9g   2/2     Running   0          12m

Let's grab those logs and inspect:

$ kubectl logs  --since=1h penumbra-testnet-preview-val-0-dltcz --all-containers > val-0-after.log
$ head -n 15 val-0-after.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-val-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-val-0
Address book already exists, not initializing...
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
+ exit 0
+ CONFIG_DIR=/home/.tendermint/config
Address book already exists, not merging configs...
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not merging configs...'
+ exit 0

Just what we want: the new validator instance comes up, using the same config in the persistent volume that was previously created.

conorsch commented 1 year ago

And here's the same, but for a fullnode in the deployment:

head -n 15 fn-0-after.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-fn-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-fn-0
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
Address book already exists, not initializing...
+ exit 0
+ CONFIG_DIR=/home/.tendermint/config
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not merging configs...'
+ exit 0
Address book already exists, not merging configs...

It was worth checking separately, since technically the fullnode and validator configs use different init logic.

conorsch commented 1 year ago

This happened again on testnet 044-ananke. We can see that the pods were destroyed and recreated ~11h ago:

❯ kubectl get pods -l app.kubernetes.io/instance=penumbra-testnet
NAME                           READY   STATUS    RESTARTS   AGE
penumbra-testnet-fn-0-ct9pb    3/3     Running   0          11h
penumbra-testnet-fn-1-7gvm5    3/3     Running   0          11h
penumbra-testnet-val-0-rf24g   2/2     Running   0          11h
penumbra-testnet-val-1-rrzp4   2/2     Running   0          11h

And this matches the lifetime of the nodes on which those pods are running:

❯ kubectl get nodes
NAME                                        STATUS   ROLES    AGE   VERSION
gke-testnet-chain-node-pool-582c2542-q3rn   Ready    <none>   11h   v1.25.5-gke.2000
gke-testnet-chain-node-pool-8f5ab500-ubvc   Ready    <none>   11h   v1.25.5-gke.2000

The don't-reinitialize logic described above was triggered:

❯ kubectl logs penumbra-testnet-fn-0-ct9pb  -c config-init
Address book already exists, not initializing...
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
+ exit 0

Which is good, but clearly not enough to keep the testnet functioning. From a node on the testnet:

Feb 18 17:56:31 shadow tendermint[715337]: E[2023-02-18|17:56:31.175] prevote step: ProposalBlock is invalid       module=consensus height=63272 round=281 err="wrong Block.Header.AppHash.  Expected DFA44D9E49CB9A07B8A6AC1A227B7212A5BF94A48E4CBA271518E3FE56E026CE, got 9232347076BF2BCF833688502A16220A97608D40BCEFEA5C94BC2201F84A4C9D"

conorsch commented 1 year ago

We disabled automatic upgrades to the node pool in f2d98df9833d29e17e762b174cca7d6e722e0b68, to minimize surprises, and filed #2011 to increase headroom on storage requests.

penumbra-zone / penumbra

Service restarts due to upgrades can destroy testnet deployment #1952