Closed conorsch closed 1 year ago
On a clean tendermint
init, we see the keyfiles and config:
tendermint init --home /tmp/sandbox-tm
I[2023-02-06|09:28:10.009] Generated private validator module=main keyFile=/tmp/sandbox-tm/config/priv_validator_key.json stateFile=/tmp/sandbox-tm/data/priv_validator_state.json
I[2023-02-06|09:28:10.009] Generated node key module=main path=/tmp/sandbox-tm/config/node_key.json
I[2023-02-06|09:28:10.009] Generated genesis file module=main path=/tmp/sandbox-tm/config/genesis.json
❯ tree /tmp/sandbox-tm
/tmp/sandbox-tm
├── config
│ ├── config.toml
│ ├── genesis.json
│ ├── node_key.json
│ └── priv_validator_key.json
└── data
└── priv_validator_state.json
3 directories, 5 files
As soon as tendermint starts, it will create config/addrbook.json
. Let's check for the existence of that file, and skip initialization if we find it, because clearly Tendermint will have started at least once before, so we should not touch its state. It might be as reasonable to gate on data/state.db
.
That seems to work well. To test, I deployed https://github.com/penumbra-zone/penumbra/commit/f0775c31e31439f0e804d0cf27908c3e8b8e0032 to preview, then pulled logs from the first validator in the preview deployment:
$ kubectl logs --since=1h penumbra-testnet-preview-val-0-xg2ts --all-containers > val-0-before.log
$ head -n 15 val-0-before.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-val-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-val-0
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ '[' '!' -d /home/.tendermint ]
+ tendermint init validator --home /home/.tendermint
I[2023-02-06|18:06:45.016] Generated private validator module=main keyFile=/home/.tendermint/config/priv_validator_key.json stateFile=/home/.tendermint/data/priv_validator_state.json
I[2023-02-06|18:06:45.017] Generated node key module=main path=/home/.tendermint/config/node_key.json
I[2023-02-06|18:06:45.017] Generated genesis file module=main path=/home/.tendermint/config/genesis.json
+ CONFIG_DIR=/home/.tendermint/config
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ '[' -d /home/tmpConfig/config ]
There we can see the key-init logic running. Then I killed the pod for the first validator, via kubectl delete pod penumbra-testnet-preview-val-0-xg2ts
. The replicationcontroller automatically created a replacement, visible as the youngest of the set here:
$ kubectl get pods -l app.kubernetes.io/instance=penumbra-testnet-preview
NAME READY STATUS RESTARTS AGE
penumbra-testnet-preview-fn-0-x9sw2 3/3 Running 0 12m
penumbra-testnet-preview-fn-1-5qxx8 3/3 Running 0 12m
penumbra-testnet-preview-val-0-dltcz 2/2 Running 0 61s
penumbra-testnet-preview-val-1-8cs9g 2/2 Running 0 12m
Let's grab those logs and inspect:
$ kubectl logs --since=1h penumbra-testnet-preview-val-0-dltcz --all-containers > val-0-after.log
$ head -n 15 val-0-after.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-val-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-val-0
Address book already exists, not initializing...
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
+ exit 0
+ CONFIG_DIR=/home/.tendermint/config
Address book already exists, not merging configs...
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not merging configs...'
+ exit 0
Just what we want: the new validator instance comes up, using the same config in the persistent volume that was previously created.
And here's the same, but for a fullnode in the deployment:
head -n 15 fn-0-after.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-fn-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-fn-0
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
Address book already exists, not initializing...
+ exit 0
+ CONFIG_DIR=/home/.tendermint/config
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not merging configs...'
+ exit 0
Address book already exists, not merging configs...
It was worth checking separately, since technically the fullnode and validator configs use different init logic.
This happened again on testnet 044-ananke. We can see that the pods were destroyed and recreated ~11h ago:
❯ kubectl get pods -l app.kubernetes.io/instance=penumbra-testnet
NAME READY STATUS RESTARTS AGE
penumbra-testnet-fn-0-ct9pb 3/3 Running 0 11h
penumbra-testnet-fn-1-7gvm5 3/3 Running 0 11h
penumbra-testnet-val-0-rf24g 2/2 Running 0 11h
penumbra-testnet-val-1-rrzp4 2/2 Running 0 11h
And this matches the lifetime of the nodes on which those pods are running:
❯ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-testnet-chain-node-pool-582c2542-q3rn Ready <none> 11h v1.25.5-gke.2000
gke-testnet-chain-node-pool-8f5ab500-ubvc Ready <none> 11h v1.25.5-gke.2000
The don't-reinitialize logic described above was triggered:
❯ kubectl logs penumbra-testnet-fn-0-ct9pb -c config-init
Address book already exists, not initializing...
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
+ exit 0
Which is good, but clearly not enough to keep the testnet functioning. From a node on the testnet:
Feb 18 17:56:31 shadow tendermint[715337]: E[2023-02-18|17:56:31.175] prevote step: ProposalBlock is invalid module=consensus height=63272 round=281 err="wrong Block.Header.AppHash. Expected DFA44D9E49CB9A07B8A6AC1A227B7212A5BF94A48E4CBA271518E3FE56E026CE, got 9232347076BF2BCF833688502A16220A97608D40BCEFEA5C94BC2201F84A4C9D"
We disabled automatic upgrades to the node pool in f2d98df9833d29e17e762b174cca7d6e722e0b68, to minimize surprises, and filed #2011 to increase headroom on storage requests.
Over the weekend we saw a failure of Testnet 42 Adraste (#1877). After investigation, it appears that an automatic node pool upgrade destroyed the deployment at around
2023-02-05T05:45+00:00
:Ostensibly this happened because we've set the cluster config node pool options to
auto_upgrade=true
, here: https://github.com/penumbra-zone/penumbra/blob/a0a6a5ca4e983886d9e058eef384c31a62bb0e2a/deployments/terraform/modules/node/v1/gke.tf#L47 but the root cause is that our initContainer logic for Tendermint keygen isn't idempotent. Let's update the latter so that we can safely restart a service from the same persistent volume and have things work just fine.