strangelove-ventures / cosmos-operator

Cosmos Operator is a kubernetes operator for managing cosmos nodes
Apache License 2.0
78 stars 19 forks source link

CosmosFullNode: Restart pods that have `ContainerStatusUnknown` #185

Closed DavidNix closed 1 year ago

DavidNix commented 1 year ago

I've seen this a few times now. Rebooting the pod fixes the issue. I'm not sure the root cause.

DavidNix commented 1 year ago

Finally remembered to copy one.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    app.kubernetes.io/ordinal: "0"
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
  creationTimestamp: "2023-01-30T18:04:30Z"
  labels:
    app.kubernetes.io/component: CosmosFullNode
    app.kubernetes.io/created-by: cosmos-operator
    app.kubernetes.io/instance: juno-mainnet-fullnode-0
    app.kubernetes.io/name: juno-mainnet-fullnode
    app.kubernetes.io/revision: 4f3ef332
    app.kubernetes.io/version: v11.0.0
    cosmos.strange.love/network: mainnet
  name: juno-mainnet-fullnode-0
  namespace: strangelove
  ownerReferences:
  - apiVersion: cosmos.strange.love/v1
    blockOwnerDeletion: true
    controller: true
    kind: CosmosFullNode
    name: juno-mainnet-fullnode
    uid: aa1fd035-e04e-47a6-93ba-8ab1c4b83801
  resourceVersion: "102236387"
  uid: 9681c76a-8907-4029-85ce-92fc8f1ed08f
spec:
  containers:
  - args:
    - start
    - --home
    - /home/operator/cosmos
    - --x-crisis-skip-assert-invariants
    command:
    - junod
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/heighliner/juno:v11.0.0
    imagePullPolicy: IfNotPresent
    name: node
    ports:
    - containerPort: 1317
      name: api
      protocol: TCP
    - containerPort: 8080
      name: rosetta
      protocol: TCP
    - containerPort: 9090
      name: grpc
      protocol: TCP
    - containerPort: 26660
      name: prometheus
      protocol: TCP
    - containerPort: 26656
      name: p2p
      protocol: TCP
    - containerPort: 26657
      name: rpc
      protocol: TCP
    - containerPort: 9091
      name: grpc-web
      protocol: TCP
    readinessProbe:
      failureThreshold: 5
      httpGet:
        path: /health
        port: 26657
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      requests:
        cpu: "1"
        memory: 12Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qklwb
      readOnly: true
    workingDir: /home/operator
  - command:
    - ihc
    image: ghcr.io/strangelove-ventures/ignite-health-check:v0.0.1
    imagePullPolicy: IfNotPresent
    name: healthcheck
    ports:
    - containerPort: 1251
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: 1251
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      requests:
        cpu: 5m
        memory: 16Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qklwb
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - -c
    - "\nset -eu\nif [ ! -d \"$CHAIN_HOME/data\" ]; then\n\techo \"Initializing chain...\"\n\tjunod
      init juno-mainnet-fullnode-0 --chain-id juno-1 --home \"$CHAIN_HOME\"\n\t# Remove
      because downstream containers check the presence of this file.\n\trm \"$GENESIS_FILE\"\nelse\n\techo
      \"Skipping chain init; already initialized.\"\nfi\n\necho \"Initializing into
      tmp dir for downstream processing...\"\njunod init juno-mainnet-fullnode-0 --chain-id
      juno-1 --home \"$HOME/.tmp\"\n"
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/heighliner/juno:v11.0.0
    imagePullPolicy: IfNotPresent
    name: chain-init
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qklwb
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - "if [ -f \"$GENESIS_FILE\" ]; then\n\techo \"Genesis file $GENESIS_FILE already
      exists; skipping initialization.\"\n\texit 0\nfi\n\nset -eu\n\n# $GENESIS_FILE
      and $CONFIG_DIR already set via pod env vars.\n\nGENESIS_URL=\"$1\"\n\necho
      \"Downloading genesis file $GENESIS_URL to $GENESIS_FILE...\"\n\ndownload_json()
      {\n  echo \"Downloading plain json...\"\n  wget -c -O \"$GENESIS_FILE\" \"$GENESIS_URL\"\n}\n\ndownload_jsongz()
      {\n  echo \"Downloading json.gz...\"\n  wget -c -O - \"$GENESIS_URL\" | gunzip
      -c > \"$GENESIS_FILE\"\n}\n\ndownload_tar() {\n  echo \"Downloading and extracting
      tar...\"\n  wget -c -O - \"$GENESIS_URL\" | tar -x -C \"$CONFIG_DIR\"\n}\n\ndownload_targz()
      {\n  echo \"Downloading and extracting compressed tar...\"\n  wget -c -O - \"$GENESIS_URL\"
      | tar -xz -C \"$CONFIG_DIR\"\n}\n\ndownload_zip() {\n  echo \"Downloading and
      extracting zip...\"\n  wget -c -O tmp_genesis.zip \"$GENESIS_URL\"\n  unzip
      tmp_genesis.zip\n  rm tmp_genesis.zip\n  mv genesis.json \"$GENESIS_FILE\"\n}\n\nrm
      -f \"$GENESIS_FILE\"\n\ncase \"$GENESIS_URL\" in\n  *.json.gz) download_jsongz
      ;;\n  *.json) download_json ;;\n  *.tar.gz) download_targz ;;\n  *.tar.gzip)
      download_targz ;;\n  *.tar) download_tar ;;\n  *.zip) download_zip ;;\n  *)
      echo \"Unable to handle file extension for $GENESIS_URL\"; exit 1 ;;\nesac\n\necho
      \"Saved genesis file to $GENESIS_FILE.\"\necho \"Download genesis file complete.\"\n\necho
      \"Genesis $GENESIS_FILE initialized.\"\n"
    - -s
    - https://download.dimi.sh/juno-phoenix2-genesis.tar.gz
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: genesis-init
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qklwb
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - |2

      set -eu
      CONFIG_DIR="$CHAIN_HOME/config"
      TMP_DIR="$HOME/.tmp/config"
      OVERLAY_DIR="$HOME/.config"
      echo "Merging config..."
      set -x
      config-merge -f toml "$TMP_DIR/config.toml" "$OVERLAY_DIR/config-overlay.toml" > "$CONFIG_DIR/config.toml"
      config-merge -f toml "$TMP_DIR/app.toml" "$OVERLAY_DIR/app-overlay.toml" > "$CONFIG_DIR/app.toml"
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: config-merge
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qklwb
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - "set -eu\nif test -n \"$(find $DATA_DIR -maxdepth 1 -name '*.db' -print -quit)\";
      then\n\techo \"Databases in $DATA_DIR already exists; skipping initialization.\"\n\texit
      0\nfi\n\nset -eu\n\n# $CHAIN_HOME already set via pod env vars.\n\nSNAPSHOT_URL=\"$1\"\n\necho
      \"Downloading snapshot archive $SNAPSHOT_URL to $CHAIN_HOME...\"\n\ndownload_tar()
      {\n  echo \"Downloading and extracting tar...\"\n  wget -c -O - \"$SNAPSHOT_URL\"
      | tar -x -C \"$CHAIN_HOME\"\n}\n\ndownload_targz() {\n  echo \"Downloading and
      extracting compressed tar...\"\n  wget -c -O - \"$SNAPSHOT_URL\" | tar -xz -C
      \"$CHAIN_HOME\"\n}\n\ndownload_lz4() {\n  echo \"Downloading and extracting
      lz4...\"\n  wget -c -O - \"$SNAPSHOT_URL\" | lz4 -c -d | tar -x -C \"$CHAIN_HOME\"\n}\n\ncase
      \"$SNAPSHOT_URL\" in\n  *.tar.lz4) download_lz4 ;;\n  *.tar.gzip) download_targz
      ;;\n  *.tar.gz) download_targz ;;\n  *.tar) download_tar ;;\n  *) echo \"Unable
      to handle file extension for $SNAPSHOT_URL\"; exit 1 ;;\nesac\n\necho \"Download
      and extract snapshot complete.\"\n\necho \"$DATA_DIR initialized.\"\n"
    - -s
    - https://snapshots.polkachu.com/snapshots/juno/juno_5373433.tar.lz4
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: snapshot-restore
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qklwb
      readOnly: true
    workingDir: /home/operator
  nodeName: gke-juno-mainnet-full-chain-node-pool-b2985e35-w530
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  readinessGates:
  - conditionType: cloud.google.com/load-balancer-neg-ready
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1025
    fsGroupChangePolicy: OnRootMismatch
    runAsGroup: 1025
    runAsNonRoot: true
    runAsUser: 1025
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: vol-chain-home
    persistentVolumeClaim:
      claimName: pvc-juno-mainnet-fullnode-0
  - emptyDir: {}
    name: vol-tmp
  - configMap:
      defaultMode: 420
      items:
      - key: config-overlay.toml
        path: config-overlay.toml
      - key: app-overlay.toml
        path: app-overlay.toml
      name: juno-mainnet-fullnode-0
    name: vol-config
  - name: kube-api-access-qklwb
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'Pod has become Healthy in NEG "Key{\"k8s1-d906fa4e-strangelove-juno-mainnet-fullnode-r-2665-e6f07b34\",
      zone: \"us-east1-d\"}" attached to BackendService "Key{\"k8s1-d906fa4e-strangelove-juno-mainnet-fullnode-r-2665-e6f07b34\"}".
      Marking condition "cloud.google.com/load-balancer-neg-ready" to True.'
    reason: LoadBalancerNegReady
    status: "True"
    type: cloud.google.com/load-balancer-neg-ready
  - lastProbeTime: null
    lastTransitionTime: "2023-01-30T18:04:40Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-01-30T22:34:27Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-01-30T22:34:27Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-01-30T18:04:30Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: ghcr.io/strangelove-ventures/ignite-health-check:v0.0.1
    imageID: ""
    lastState:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was deleted.  The
          container used to be Running
        reason: ContainerStatusUnknown
        startedAt: null
    name: healthcheck
    ready: false
    restartCount: 1
    started: false
    state:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        startedAt: null
  - containerID: containerd://5508864cbe02118c152283f3755a4f0bf6dfa23e4b1d46be0994b28d743a6bf1
    image: ghcr.io/strangelove-ventures/heighliner/juno:v11.0.0
    imageID: ghcr.io/strangelove-ventures/heighliner/juno@sha256:f65390b4383bdde4ae37e9b712e42b595dea98cdf9b1622450cd564c1b544ebf
    lastState: {}
    name: node
    ready: false
    restartCount: 1
    started: false
    state:
      terminated:
        containerID: containerd://5508864cbe02118c152283f3755a4f0bf6dfa23e4b1d46be0994b28d743a6bf1
        exitCode: 137
        finishedAt: "2023-01-30T22:34:27Z"
        reason: OOMKilled
        startedAt: "2023-01-30T19:47:18Z"
  hostIP: 192.168.5.3
  initContainerStatuses:
  - containerID: containerd://f5a895b64fe59e598cdaed578782dc2edfe214d3b0e78efd80f57c998bc56829
    image: ghcr.io/strangelove-ventures/heighliner/juno:v11.0.0
    imageID: ghcr.io/strangelove-ventures/heighliner/juno@sha256:f65390b4383bdde4ae37e9b712e42b595dea98cdf9b1622450cd564c1b544ebf
    lastState: {}
    name: chain-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://f5a895b64fe59e598cdaed578782dc2edfe214d3b0e78efd80f57c998bc56829
        exitCode: 0
        finishedAt: "2023-01-30T18:04:36Z"
        reason: Completed
        startedAt: "2023-01-30T18:04:36Z"
  - containerID: containerd://5f001f9bae66df709d99ced2f05a6e009823c70d2dc0d64388e300e8026ddc14
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: genesis-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://5f001f9bae66df709d99ced2f05a6e009823c70d2dc0d64388e300e8026ddc14
        exitCode: 0
        finishedAt: "2023-01-30T18:04:37Z"
        reason: Completed
        startedAt: "2023-01-30T18:04:37Z"
  - containerID: containerd://a46fa8e1651059d56cab72ccaab800dbe356f2666767610a8d8154180e3f348e
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: config-merge
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://a46fa8e1651059d56cab72ccaab800dbe356f2666767610a8d8154180e3f348e
        exitCode: 0
        finishedAt: "2023-01-30T18:04:39Z"
        reason: Completed
        startedAt: "2023-01-30T18:04:38Z"
  - containerID: containerd://dc8fac97f26ae0f039f921d037d280051ef5f6ad4771ff051d613b4069166ee8
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: snapshot-restore
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://dc8fac97f26ae0f039f921d037d280051ef5f6ad4771ff051d613b4069166ee8
        exitCode: 0
        finishedAt: "2023-01-30T18:04:39Z"
        reason: Completed
        startedAt: "2023-01-30T18:04:39Z"
  message: 'The node was low on resource: memory. Container node was using 35505620Ki,
    which exceeds its request of 12Gi. '
  phase: Failed
  podIP: 10.7.0.103
  podIPs:
  - ip: 10.7.0.103
  qosClass: Burstable
  reason: Evicted
  startTime: "2023-01-30T18:04:30Z"
DavidNix commented 1 year ago

Another example, just in case there's differences.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    app.kubernetes.io/ordinal: "5"
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
  creationTimestamp: "2023-01-31T21:03:10Z"
  labels:
    app.kubernetes.io/component: CosmosFullNode
    app.kubernetes.io/created-by: cosmos-operator
    app.kubernetes.io/instance: juno-mainnet-fullnode-5
    app.kubernetes.io/name: juno-mainnet-fullnode
    app.kubernetes.io/revision: 4f3ef332
    app.kubernetes.io/version: v11.0.0
    cosmos.strange.love/network: mainnet
  name: juno-mainnet-fullnode-5
  namespace: strangelove
  ownerReferences:
  - apiVersion: cosmos.strange.love/v1
    blockOwnerDeletion: true
    controller: true
    kind: CosmosFullNode
    name: juno-mainnet-fullnode
    uid: aa1fd035-e04e-47a6-93ba-8ab1c4b83801
  resourceVersion: "105075022"
  uid: 6fd95328-264d-4730-b47c-3bf808e89850
spec:
  containers:
  - args:
    - start
    - --home
    - /home/operator/cosmos
    - --x-crisis-skip-assert-invariants
    command:
    - junod
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/heighliner/juno:v11.0.0
    imagePullPolicy: IfNotPresent
    name: node
    ports:
    - containerPort: 1317
      name: api
      protocol: TCP
    - containerPort: 8080
      name: rosetta
      protocol: TCP
    - containerPort: 9090
      name: grpc
      protocol: TCP
    - containerPort: 26660
      name: prometheus
      protocol: TCP
    - containerPort: 26656
      name: p2p
      protocol: TCP
    - containerPort: 26657
      name: rpc
      protocol: TCP
    - containerPort: 9091
      name: grpc-web
      protocol: TCP
    readinessProbe:
      failureThreshold: 5
      httpGet:
        path: /health
        port: 26657
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      requests:
        cpu: "1"
        memory: 12Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shnpw
      readOnly: true
    workingDir: /home/operator
  - command:
    - ihc
    image: ghcr.io/strangelove-ventures/ignite-health-check:v0.0.1
    imagePullPolicy: IfNotPresent
    name: healthcheck
    ports:
    - containerPort: 1251
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: 1251
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      requests:
        cpu: 5m
        memory: 16Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shnpw
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - -c
    - "\nset -eu\nif [ ! -d \"$CHAIN_HOME/data\" ]; then\n\techo \"Initializing chain...\"\n\tjunod
      init juno-mainnet-fullnode-5 --chain-id juno-1 --home \"$CHAIN_HOME\"\n\t# Remove
      because downstream containers check the presence of this file.\n\trm \"$GENESIS_FILE\"\nelse\n\techo
      \"Skipping chain init; already initialized.\"\nfi\n\necho \"Initializing into
      tmp dir for downstream processing...\"\njunod init juno-mainnet-fullnode-5 --chain-id
      juno-1 --home \"$HOME/.tmp\"\n"
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/heighliner/juno:v11.0.0
    imagePullPolicy: IfNotPresent
    name: chain-init
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shnpw
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - "if [ -f \"$GENESIS_FILE\" ]; then\n\techo \"Genesis file $GENESIS_FILE already
      exists; skipping initialization.\"\n\texit 0\nfi\n\nset -eu\n\n# $GENESIS_FILE
      and $CONFIG_DIR already set via pod env vars.\n\nGENESIS_URL=\"$1\"\n\necho
      \"Downloading genesis file $GENESIS_URL to $GENESIS_FILE...\"\n\ndownload_json()
      {\n  echo \"Downloading plain json...\"\n  wget -c -O \"$GENESIS_FILE\" \"$GENESIS_URL\"\n}\n\ndownload_jsongz()
      {\n  echo \"Downloading json.gz...\"\n  wget -c -O - \"$GENESIS_URL\" | gunzip
      -c > \"$GENESIS_FILE\"\n}\n\ndownload_tar() {\n  echo \"Downloading and extracting
      tar...\"\n  wget -c -O - \"$GENESIS_URL\" | tar -x -C \"$CONFIG_DIR\"\n}\n\ndownload_targz()
      {\n  echo \"Downloading and extracting compressed tar...\"\n  wget -c -O - \"$GENESIS_URL\"
      | tar -xz -C \"$CONFIG_DIR\"\n}\n\ndownload_zip() {\n  echo \"Downloading and
      extracting zip...\"\n  wget -c -O tmp_genesis.zip \"$GENESIS_URL\"\n  unzip
      tmp_genesis.zip\n  rm tmp_genesis.zip\n  mv genesis.json \"$GENESIS_FILE\"\n}\n\nrm
      -f \"$GENESIS_FILE\"\n\ncase \"$GENESIS_URL\" in\n  *.json.gz) download_jsongz
      ;;\n  *.json) download_json ;;\n  *.tar.gz) download_targz ;;\n  *.tar.gzip)
      download_targz ;;\n  *.tar) download_tar ;;\n  *.zip) download_zip ;;\n  *)
      echo \"Unable to handle file extension for $GENESIS_URL\"; exit 1 ;;\nesac\n\necho
      \"Saved genesis file to $GENESIS_FILE.\"\necho \"Download genesis file complete.\"\n\necho
      \"Genesis $GENESIS_FILE initialized.\"\n"
    - -s
    - https://download.dimi.sh/juno-phoenix2-genesis.tar.gz
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: genesis-init
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shnpw
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - |2

      set -eu
      CONFIG_DIR="$CHAIN_HOME/config"
      TMP_DIR="$HOME/.tmp/config"
      OVERLAY_DIR="$HOME/.config"
      echo "Merging config..."
      set -x
      config-merge -f toml "$TMP_DIR/config.toml" "$OVERLAY_DIR/config-overlay.toml" > "$CONFIG_DIR/config.toml"
      config-merge -f toml "$TMP_DIR/app.toml" "$OVERLAY_DIR/app-overlay.toml" > "$CONFIG_DIR/app.toml"
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: config-merge
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shnpw
      readOnly: true
    workingDir: /home/operator
  - args:
    - -c
    - "set -eu\nif test -n \"$(find $DATA_DIR -maxdepth 1 -name '*.db' -print -quit)\";
      then\n\techo \"Databases in $DATA_DIR already exists; skipping initialization.\"\n\texit
      0\nfi\n\nset -eu\n\n# $CHAIN_HOME already set via pod env vars.\n\nSNAPSHOT_URL=\"$1\"\n\necho
      \"Downloading snapshot archive $SNAPSHOT_URL to $CHAIN_HOME...\"\n\ndownload_tar()
      {\n  echo \"Downloading and extracting tar...\"\n  wget -c -O - \"$SNAPSHOT_URL\"
      | tar -x -C \"$CHAIN_HOME\"\n}\n\ndownload_targz() {\n  echo \"Downloading and
      extracting compressed tar...\"\n  wget -c -O - \"$SNAPSHOT_URL\" | tar -xz -C
      \"$CHAIN_HOME\"\n}\n\ndownload_lz4() {\n  echo \"Downloading and extracting
      lz4...\"\n  wget -c -O - \"$SNAPSHOT_URL\" | lz4 -c -d | tar -x -C \"$CHAIN_HOME\"\n}\n\ncase
      \"$SNAPSHOT_URL\" in\n  *.tar.lz4) download_lz4 ;;\n  *.tar.gzip) download_targz
      ;;\n  *.tar.gz) download_targz ;;\n  *.tar) download_tar ;;\n  *) echo \"Unable
      to handle file extension for $SNAPSHOT_URL\"; exit 1 ;;\nesac\n\necho \"Download
      and extract snapshot complete.\"\n\necho \"$DATA_DIR initialized.\"\n"
    - -s
    - https://snapshots.polkachu.com/snapshots/juno/juno_5373433.tar.lz4
    command:
    - sh
    env:
    - name: HOME
      value: /home/operator
    - name: CHAIN_HOME
      value: /home/operator/cosmos
    - name: GENESIS_FILE
      value: /home/operator/cosmos/config/genesis.json
    - name: CONFIG_DIR
      value: /home/operator/cosmos/config
    - name: DATA_DIR
      value: /home/operator/cosmos/data
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imagePullPolicy: IfNotPresent
    name: snapshot-restore
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/operator/cosmos
      name: vol-chain-home
    - mountPath: /home/operator/.tmp
      name: vol-tmp
    - mountPath: /home/operator/.config
      name: vol-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shnpw
      readOnly: true
    workingDir: /home/operator
  nodeName: gke-juno-mainnet-full-chain-node-pool-b2985e35-w530
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  readinessGates:
  - conditionType: cloud.google.com/load-balancer-neg-ready
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1025
    fsGroupChangePolicy: OnRootMismatch
    runAsGroup: 1025
    runAsNonRoot: true
    runAsUser: 1025
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: vol-chain-home
    persistentVolumeClaim:
      claimName: pvc-juno-mainnet-fullnode-5
  - emptyDir: {}
    name: vol-tmp
  - configMap:
      defaultMode: 420
      items:
      - key: config-overlay.toml
        path: config-overlay.toml
      - key: app-overlay.toml
        path: app-overlay.toml
      name: juno-mainnet-fullnode-5
    name: vol-config
  - name: kube-api-access-shnpw
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'Pod has become Healthy in NEG "Key{\"k8s1-d906fa4e-strangelove-juno-mainnet-fullnode-rp-909-3bd6d742\",
      zone: \"us-east1-d\"}" attached to BackendService "Key{\"k8s1-d906fa4e-strangelove-juno-mainnet-fullnode-rp-909-3bd6d742\"}".
      Marking condition "cloud.google.com/load-balancer-neg-ready" to True.'
    reason: LoadBalancerNegReady
    status: "True"
    type: cloud.google.com/load-balancer-neg-ready
  - lastProbeTime: null
    lastTransitionTime: "2023-01-31T21:03:25Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-02-02T17:48:32Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-02-02T17:48:32Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-01-31T21:03:10Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: ghcr.io/strangelove-ventures/ignite-health-check:v0.0.1
    imageID: ""
    lastState:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was deleted.  The
          container used to be Running
        reason: ContainerStatusUnknown
        startedAt: null
    name: healthcheck
    ready: false
    restartCount: 1
    started: false
    state:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        startedAt: null
  - containerID: containerd://25229be205aae5f8008bb9bf0b0731176663c495bfe04460492a72a61db33724
    image: ghcr.io/strangelove-ventures/heighliner/juno:v11.0.0
    imageID: ghcr.io/strangelove-ventures/heighliner/juno@sha256:f65390b4383bdde4ae37e9b712e42b595dea98cdf9b1622450cd564c1b544ebf
    lastState: {}
    name: node
    ready: false
    restartCount: 1
    started: false
    state:
      terminated:
        containerID: containerd://25229be205aae5f8008bb9bf0b0731176663c495bfe04460492a72a61db33724
        exitCode: 137
        finishedAt: "2023-02-02T17:48:31Z"
        reason: OOMKilled
        startedAt: "2023-02-01T12:36:15Z"
  hostIP: 192.168.5.3
  initContainerStatuses:
  - containerID: containerd://cd099063efc6fc6c07c8242b467d6fbd55aca853b4c470274be5353527aee93a
    image: ghcr.io/strangelove-ventures/heighliner/juno:v11.0.0
    imageID: ghcr.io/strangelove-ventures/heighliner/juno@sha256:f65390b4383bdde4ae37e9b712e42b595dea98cdf9b1622450cd564c1b544ebf
    lastState: {}
    name: chain-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://cd099063efc6fc6c07c8242b467d6fbd55aca853b4c470274be5353527aee93a
        exitCode: 0
        finishedAt: "2023-01-31T21:03:20Z"
        reason: Completed
        startedAt: "2023-01-31T21:03:20Z"
  - containerID: containerd://fba829126cbb0edeade19e43740a3fbf601e27dfc3a9d0781a8abdc04c4cb723
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: genesis-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://fba829126cbb0edeade19e43740a3fbf601e27dfc3a9d0781a8abdc04c4cb723
        exitCode: 0
        finishedAt: "2023-01-31T21:03:21Z"
        reason: Completed
        startedAt: "2023-01-31T21:03:21Z"
  - containerID: containerd://7c8f483695363bc7ca7b4710b8a162a0a5ee233a320944bf52ae0f7f98da647e
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: config-merge
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://7c8f483695363bc7ca7b4710b8a162a0a5ee233a320944bf52ae0f7f98da647e
        exitCode: 0
        finishedAt: "2023-01-31T21:03:23Z"
        reason: Completed
        startedAt: "2023-01-31T21:03:22Z"
  - containerID: containerd://66c6157d18d5f6f427afeaead27ea52a012b51ba778d9fc12cd9099b2c922d34
    image: ghcr.io/strangelove-ventures/infra-toolkit:v0.0.1
    imageID: ghcr.io/strangelove-ventures/infra-toolkit@sha256:3aecfa18d9f0d730fd8821a7f556ba89dbb8cb3683a5e6ac11ec272884db8776
    lastState: {}
    name: snapshot-restore
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://66c6157d18d5f6f427afeaead27ea52a012b51ba778d9fc12cd9099b2c922d34
        exitCode: 0
        finishedAt: "2023-01-31T21:03:24Z"
        reason: Completed
        startedAt: "2023-01-31T21:03:24Z"
  message: 'The node was low on resource: memory. Container node was using 30684Mi,
    which exceeds its request of 12Gi. '
  phase: Failed
  podIP: 10.7.0.107
  podIPs:
  - ip: 10.7.0.107
  qosClass: Burstable
  reason: Evicted
  startTime: "2023-01-31T21:03:10Z"
DavidNix commented 1 year ago

Juno, specifically, may be under resourced.

DavidNix commented 1 year ago

I feel adding a feature to the operator is treating the symptom and not the cause. This indicates an issue with the cluster (not the cosmos node).

The only helpful advice I found was from https://github.com/kubernetes/kubernetes/issues/43279. This thread indicates the k8s node could become unresponsive to the kubelet if the node exhausts its memory.

Through working with the GCP support team, we figured out that the issue was triggered by not having memory limits on the pods, which was causing the oomkiller to run on the servers, sometimes killing processes it shouldn't. Even worse, the scheduler rescheduled these troublesome pods on other nodes, effectually poisoning the entire cluster. This is definitely something that should be prevented, but can at least be mitigated by setting default memory limits and making sure the limits on your pods are not too high.

So we'll test a change with our Juno deployment config and observe.

Juno seems to run fine at around ~12GB memory. Perhaps it spikes during a restart. For us, this problem only occurs with Juno.