strangelove-ventures / cosmos-operator

Cosmos Operator is a kubernetes operator for managing cosmos nodes
Apache License 2.0
79 stars 19 forks source link

Pod should auto-restart when encountering cometbft bug. #409

Open danbryan opened 7 months ago

danbryan commented 7 months ago

Pods stop syncing blocks periodically when they encounter th cometbft bug. Lets try to identify a way to know when this occurred, and auto restart the pod. Could be as simple as no response from the status endpoint for 2 mins.

### Tasks
- [ ] document the bug (ie, file an issue on `CometBFT` repo)
- [ ] set up our auto restart
danbryan commented 6 months ago

@vimystic can you provide an update on this?

vimystic commented 6 months ago

Is there a description of the cometbft bug itself somewhere ?

danbryan commented 6 months ago

@agouin are you able to describe or link to the bug?

@vimystic here is a script that identifies and restarts pods that are impacted by this bug.

#!/bin/bash

kubectl config use-context sentry-mainnet@sl-colo
PODS=( $(kubectl get pods -A | grep cosmos-sentry | awk '{print $1,$2}') )

for (( i=0; i<${#PODS[@]} ; i+=2 )) ; do
    ns="${PODS[i]}"
    pod="${PODS[i+1]}"
    kubectl logs -c node --tail=30 -n $ns $pod | grep "SignerListener: Connected" 2>&1 > /dev/null
    if [[ "$?" == "0" ]]; then
      echo "ns: ${PODS[i]} pod: ${PODS[i+1]} is stuck"
      kubectl delete --wait=false pod -n $ns $pod
    fi
done
jonathanpberger commented 6 months ago

depends on kubectl config secret.

vimystic commented 5 months ago

Blocked until https://github.com/strangelove-ventures/infra/issues/3020 is completed.

jonathanpberger commented 4 months ago

https://github.com/strangelove-ventures/infra/issues/3020 is complete! Unblocking.