Open danbryan opened 7 months ago
@vimystic can you provide an update on this?
Is there a description of the cometbft bug itself somewhere ?
@agouin are you able to describe or link to the bug?
@vimystic here is a script that identifies and restarts pods that are impacted by this bug.
#!/bin/bash
kubectl config use-context sentry-mainnet@sl-colo
PODS=( $(kubectl get pods -A | grep cosmos-sentry | awk '{print $1,$2}') )
for (( i=0; i<${#PODS[@]} ; i+=2 )) ; do
ns="${PODS[i]}"
pod="${PODS[i+1]}"
kubectl logs -c node --tail=30 -n $ns $pod | grep "SignerListener: Connected" 2>&1 > /dev/null
if [[ "$?" == "0" ]]; then
echo "ns: ${PODS[i]} pod: ${PODS[i+1]} is stuck"
kubectl delete --wait=false pod -n $ns $pod
fi
done
depends on kubectl config
secret.
Blocked until https://github.com/strangelove-ventures/infra/issues/3020 is completed.
https://github.com/strangelove-ventures/infra/issues/3020 is complete! Unblocking.
Pods stop syncing blocks periodically when they encounter th cometbft bug. Lets try to identify a way to know when this occurred, and auto restart the pod. Could be as simple as no response from the status endpoint for 2 mins.