"Terminated" pod on shutdown node listed in service edpoints.

tcolgate commented 2 years ago

What happened?

On our GKE cluster (using preemptible nodes), pods are correctly marked as "terminated", but are still showing as "ready" in the endpoints list, causing traffic to route to them.

Deleting the Terminated pods removes them from the endpoints list.

The Terminated pods reappear after a node preemption.

What did you expect to happen?

Terminated pods should not be listed in endpoints.

How can we reproduce it (as minimally and precisely as possible)?

Create a GKE cluster using k8s 1.22.8-gke.200 Add a node group using preemptible nodes, on Container OS and containerd. Create a deployment and service. Check endpoint list after a preemption event.

Anything else we need to know?

Here is the state of the pod, endpoints, edpointsslice, and the service definition.

Example pod (reason shows terminated due to node shutdown), redacted image name details and some args and command details:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container
      docs; cpu request for init container docs-beanbag-init'
    prometheus.io/scrape: "true"
  creationTimestamp: "2022-04-26T10:48:14Z"
  generateName: docs-default-7c4ff6b58f-
  labels:
    app: docs
    dnsName: docs-staging.qubit.com
    externalAccess: public
    pod-template-hash: 7c4ff6b58f
    slack_channel: eng-frontend-alerts
    team: product_eng
    track: default
  name: docs-default-7c4ff6b58f-kqz9p
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: docs-default-7c4ff6b58f
    uid: 9ea90c14-9c18-48ce-beea-43772405acc0
  resourceVersion: "820041662"
  uid: d7938ddc-c167-4c76-848e-eb373b4f8812
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - docs
          topologyKey: kubernetes.io/hostname
        weight: 1
  containers:
  - env:
    - name: KUBERNETES_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: KUBERNETES_POD_IP_ADDRESS
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: KUBERNETES_HOST_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: KUBERNETES_CPU_REQUEST
      valueFrom:
        resourceFieldRef:
          containerName: docs
          divisor: "0"
          resource: requests.cpu
    - name: KUBERNETES_MEMORY_REQUEST
      valueFrom:
        resourceFieldRef:
          containerName: docs
          divisor: "0"
          resource: requests.memory
    image: XXXX
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /status
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: docs
    ports:
    - containerPort: 1028
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /status
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 1Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /config
      name: templatesdir-docs
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-nqg5n
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - XXXX
    command:
    - /XXXX
    env:
    - name: KUBERNETES_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: KUBERNETES_HOST_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: TLS_DOMAIN
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: TLS_IP_ADDRESS
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    image: XXXX
    imagePullPolicy: Always
    name: beanbag-init
    ports:
    - containerPort: 5000
      name: config-http
      protocol: TCP
    resources:
      requests:
        cpu: 10m
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /config
      name: templatesdir-docs
    - mountPath: /token-data
      name: tokendata-docs
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-nqg5n
      readOnly: true
  nodeName: gke-europe-west1-pre-n1-4cpu-26gb-0298d3a1-fcsj
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: gke.io/optimize-utilization-scheduler
  securityContext: {}
  serviceAccount: docs
  serviceAccountName: docs
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
    name: templatesdir-docs
  - emptyDir:
      medium: Memory
    name: tokendata-docs
  - name: kube-api-access-nqg5n
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-04-27T10:53:47Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-04-27T10:53:56Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-04-27T10:53:56Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-04-26T10:48:14Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://2bf2d59e54e86e4fa0921813eb58e26defdbb18c731e00519645376dd7f6a024
    image: XXXX
    imageID: XXXX
    lastState: {}
    name: docs
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-04-27T10:53:47Z"
  hostIP: 10.132.0.11
  initContainerStatuses:
  - containerID: containerd://0a0f5165e9b3126c50efe478bee822f8ea773f7d07c1bc94ab63a4e33b7bb8fc
    image: XXXX
    imageID: XXXX
    lastState: {}
    name: docs-beanbag-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://0a0f5165e9b3126c50efe478bee822f8ea773f7d07c1bc94ab63a4e33b7bb8fc
        exitCode: 0
        finishedAt: "2022-04-27T10:53:29Z"
        reason: Completed
        startedAt: "2022-04-27T10:53:28Z"
  message: Pod was terminated in response to imminent node shutdown.
  phase: Failed
  podIP: 10.4.1.7
  podIPs:
  - ip: 10.4.1.7
  qosClass: Burstable
  reason: Terminated
  startTime: "2022-04-26T10:48:14Z"

endpoints:

apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2022-04-28T11:03:10Z"
  creationTimestamp: "2020-09-22T10:23:21Z"
  labels:
    app: docs
    app.kubernetes.io/managed-by: Helm
    chart: baton-docs
    release: docs
    revision: "7"
  name: docs
  namespace: default
  resourceVersion: "820049716"
  uid: f0513b7c-cae3-4186-88c0-07297ca03932
subsets:
- addresses:
  - ip: 10.4.0.34
    nodeName: gke-europe-west1-pre-n1-4cpu-26gb-154195a1-qltt
    targetRef:
      kind: Pod
      name: docs-default-7c4ff6b58f-6l979
      namespace: default
      resourceVersion: "820049714"
      uid: 2323de45-c368-44ff-b2ae-fa7b47524c20
  - ip: 10.4.1.7
    nodeName: gke-europe-west1-pre-n1-4cpu-26gb-0298d3a1-fcsj
    targetRef:
      kind: Pod
      name: docs-default-7c4ff6b58f-kqz9p
      namespace: default
      resourceVersion: "820041662"
      uid: d7938ddc-c167-4c76-848e-eb373b4f8812
  ports:
  - name: http
    port: 1028
    protocol: TCP

endpointsslice

addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
  - 10.4.1.7
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: gke-europe-west1-pre-n1-4cpu-26gb-0298d3a1-fcsj
  targetRef:
    kind: Pod
    name: docs-default-7c4ff6b58f-kqz9p
    namespace: default
    resourceVersion: "820041662"
    uid: d7938ddc-c167-4c76-848e-eb373b4f8812
  zone: europe-west1-b
- addresses:
  - 10.4.0.34
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: gke-europe-west1-pre-n1-4cpu-26gb-154195a1-qltt
  targetRef:
    kind: Pod
    name: docs-default-7c4ff6b58f-6l979
    namespace: default
    resourceVersion: "820049714"
    uid: 2323de45-c368-44ff-b2ae-fa7b47524c20
  zone: europe-west1-c
kind: EndpointSlice
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2022-04-28T11:03:10Z"
  creationTimestamp: "2021-01-25T08:54:44Z"
  generateName: docs-
  generation: 2217
  labels:
    app: docs
    app.kubernetes.io/managed-by: Helm
    chart: baton-docs
    endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
    kubernetes.io/service-name: docs
    release: docs
    revision: "7"
  name: docs-qh77b
  namespace: default
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Service
    name: docs
    uid: b56dc06a-987e-4b78-b3ef-5ea61c4ec124
  resourceVersion: "820049718"
  uid: 7118f7a9-6c2e-4088-9b59-ac8fae154831
ports:
- name: http
  port: 1028
  protocol: TCP

and the service

apiVersion: v1
kind: Service
metadata:
  annotations:
    baton.qutics.com/lazy-tristan: makes the template easier
    meta.helm.sh/release-name: docs
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2020-09-22T10:23:21Z"
  labels:
    app: docs
    app.kubernetes.io/managed-by: Helm
    chart: baton-docs
    release: docs
    revision: "7"
  name: docs
  namespace: default
  resourceVersion: "803310896"
  uid: b56dc06a-987e-4b78-b3ef-5ea61c4ec124
spec:
  clusterIP: 10.69.11.121
  clusterIPs:
  - 10.69.11.121
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 1028
  selector:
    app: docs
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Kubernetes version

1.22.8-gke.200 ```console $ kubectl version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6", GitCommit:"f59f5c2fda36e4036b49ec027e556a15456108f0", GitTreeState:"clean", BuildDate:"2022-01-19T17:33:06Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.8-gke.200", GitCommit:"87bd5277f85536686030dd1fd7d6e39776dd44d1", GitTreeState:"clean", BuildDate:"2022-03-17T20:42:25Z", GoVersion:"go1.16.14b7", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider

GKE,

OS version

Nodes are cos_containerd 1.22.8-gke.200

Install tools

GKE

Container runtime (CRI) and version (if applicable)

containerd

Related plugins (CNI, CSI, ...) and versions (if applicable)

GKE defaults

tcolgate commented 2 years ago

We are aware that running production workloads on pre-emptible nodes is "not recommended", however this behaviour was not present in 1.21, and appears fundamentally wrong.

tcolgate commented 2 years ago

This pods kill during node termination do not seem to get a DeletionTimestamp, so I think the endpoints controller considers them up. Presumably they should either be considered deleted, or atleaast have the IP unassiged?

tcolgate commented 2 years ago

The status update in the nodeshutdown manager does not seem sufficient on it's own to have the pod removed from endpoints. The pod is marked as failed, with the reason also updated, but not deleted, or otherwise updated. Is something else expected to release the IP and actual delete/evict the pod? The state of my pod suggests that, after quite a bit of time, this is the only update made to the pods on a terminated node.

https://github.com/kubernetes/kubernetes/blob/release-1.22/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L272-L277

The intent of this code hasn't changed from 1.21 (where we have not seen this issue). It could be a change in the endpoints controller, or in the kubelet podworkers syncPod logic (or possibly a timeout on the server side).

tcolgate commented 2 years ago

It is odd that the endpoints slice shows the pod as ready and not terminating. Interestingly, the IP was used by two pods around the same time.

docs-default-7c4ff6b58f-kqz9p                                     1/1     Terminated               0                3d2h    10.4.1.7       gke-europe-west1-pre-n1-4cpu-26gb-0298d3a1-fcsj    <none>           <none>
experiment-stats-worker-ppos-large-default-7b59cbc4cf-5lb49       0/1     Completed                0                3d2h    10.4.1.7       gke-europe-west1-pre-n1-4cpu-26gb-0298d3a1-fcsj    <none>           <none>

both claim to have been terminated at node shutdown, which seems rather implausible. THis is the second pod. Though the second pod did stop before the first started. The phase and message seem to have been overwritten, despite the fact that the pod wasn't actual running at the time (It was already completed).

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2022-04-26T10:46:17Z"
  generateName: experiment-stats-worker-ppos-large-default-7b59cbc4cf-
  name: experiment-stats-worker-ppos-large-default-7b59cbc4cf-5lb49
  namespace: default

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-04-26T10:49:44Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-04-27T10:49:55Z"
    message: 'containers with unready status: [experiment-stats-worker-ppos-large]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-04-27T10:49:55Z"
    message: 'containers with unready status: [experiment-stats-worker-ppos-large]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-04-26T10:46:25Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://6cd366e2c7c5dad656a8ac7b158f2662c6da682788e14b479df2b6086aa7028b
    image: eu.gcr.io/qubit-registry/deploy-containers/experiment-stats-worker_qubit-apps-eu-stg_europe-west1:108ee3fccdcf10e244909e15a00022b59a7e4506
    imageID: eu.gcr.io/qubit-registry/deploy-containers/experiment-stats-worker_qubit-apps-eu-stg_europe-west1@sha256:18e73da2ef4be2e926fbb5d7863f45e6ac0240b7fcdb2d220a7fdba22a0ca8e1
    lastState: {}
    name: experiment-stats-worker-ppos-large
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://6cd366e2c7c5dad656a8ac7b158f2662c6da682788e14b479df2b6086aa7028b
        exitCode: 0
        finishedAt: "2022-04-27T10:49:52Z"
        reason: Completed
        startedAt: "2022-04-26T10:49:44Z"
  hostIP: 10.132.0.57
  initContainerStatuses:
  - containerID: containerd://2293c9886aaaca7067cd1cf4f4f3ff1b568b8b7a28e0ba8fb77954ad6a9450e0
    image: eu.gcr.io/qubit-registry/deploy-containers/experiment-stats-worker_qubit-apps-eu-stg_europe-west1:108ee3fccdcf10e244909e15a00022b59a7e4506
    imageID: eu.gcr.io/qubit-registry/deploy-containers/experiment-stats-worker_qubit-apps-eu-stg_europe-west1@sha256:18e73da2ef4be2e926fbb5d7863f45e6ac0240b7fcdb2d220a7fdba22a0ca8e1
    lastState: {}
    name: experiment-stats-worker-ppos-large-beanbag-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://2293c9886aaaca7067cd1cf4f4f3ff1b568b8b7a28e0ba8fb77954ad6a9450e0
        exitCode: 0
        finishedAt: "2022-04-26T10:49:32Z"
        reason: Completed
        startedAt: "2022-04-26T10:49:31Z"
  message: Pod was terminated in response to imminent node shutdown.
  phase: Failed
  podIP: 10.4.1.7
  podIPs:
  - ip: 10.4.1.7
  qosClass: Burstable
  reason: Terminated
  startTime: "2022-04-26T10:46:25Z"

aojea commented 2 years ago

r in the kubelet podworkers syncPod logic (or possibly a timeout on the server side).

that logic had a major refactor and several changes about the pod phased on termination

https://github.com/kubernetes/kubernetes/pull/103668 https://github.com/kubernetes/kubernetes/pull/104817 https://github.com/kubernetes/kubernetes/pull/104918

tcolgate commented 2 years ago

r in the kubelet podworkers syncPod logic (or possibly a timeout on the server side).

that logic had a major refactor and several changes about the pod phased on termination

103668 #104817 #104918

I'm trying to walk through that code at the moment, the definition of Runtime-only pod doesn't seem super clear (they aren't submitted to the node shutdown pod kill marked as such, so either it doesn't apply, or they get marked as such by the cache layer in some process I've not seen yet).

The final pod status is "Terminated", (rather than "Terminating"), so I don't /think/ the kubelet is killed before the process completes. Either way, a Terminated pod shouldn't be active in an endpoints list (should it?).

tcolgate commented 2 years ago

I've tried to manually recreate the issue but forcing a systemctl reboot on a node. In this case I see the same reason and message, but all the containers show as stopped, and it is removed from the endpoints.

The main difference between this test and a true preemption is time, the preempted machine has a specific time limit to complete the shutdown in. It may be that the node is being killed before the container statuses are updated (though the pod is updated as Terminated), could also be that there is an error during the container shutodwn, but I need to get to the kubelet logs to see that.

tcolgate commented 2 years ago

Yes, so I think the bit that feels wrong is that in the original problem pod we have

status:  
  containerStatuses:
  - containerID: containerd://2bf2d59e54e86e4fa0921813eb58e26defdbb18c731e00519645376dd7f6a024
    image: XXXX
    imageID: XXXX
    lastState: {}
    name: docs
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-04-27T10:53:47Z"

Which is enough for the endpoints and endpointsslice controllers to thing that that should be part of the active endpoints lists, even though the actual pod is in phase: Failed.

aojea commented 2 years ago

this is what the endpoins controllers use to determine if the pod should be listed

https://github.com/kubernetes/kubernetes/blob/e7a2ce75e5df96ba6ea51d904bf2735397b3e203/pkg/controller/util/endpoint/controller_utils.go#L138-L158

tcolgate commented 2 years ago

The other possible culprit here is that these are containerd, not docker, nodes. Unfortunately other bugs mean we can't revert the base OS to the docker version. It's possible that it is the KillPod logic n pkg/kubelet/kuberuntime/kuberuntime_manager.go. The kubelet was clearly alive long enough to mark the pod as Terminated, so I'm assuming it was around long enough to at least attempt to stop the containers. Cannot currently access the kubelet logs, waiting on vendor feedback on that.

aojea commented 2 years ago

will tag both sigs involved in the meantime, we we'll wait for the logs

/sig node /sig network

tcolgate commented 2 years ago

this is what the endpoins controllers use to determine if the pod should be listed

~~Yes, endpoints uses something similar (though not quite as clearly stated)~~, since the pod isn't terminating (it is already terminated, and Failed), and the pod has no deletion timestamp (it isn't deleted, I believe that is intentional to allow the pod to be visible to related jobs (for instance).

I can't decide if it is ever valid for pod.Phase = "Failed" pods to ever appear in endpoints

tcolgate commented 2 years ago

I have managed to retrieve the kubelet logs around the node terminations. It looks like the pod deletion does not complete before the node is shutdow. The pod is marked as failed, but this appears to happen before the containers are marked as failed, which is leaving the pod in a state that the endpoints controller (as show above), still considers capable of serving traffic.

For a problem pod (that still gets listed in endpoints), the following is present in the kubelet log

I0503 10:47:29.051548    1834 nodeshutdown_manager_linux.go:270] "Shutdown manager killing pod with gracePeriod" pod="default/problem-default-57997dd668-pmtj5" gracePeriod=15
I0503 10:47:29.054407    1834 kuberuntime_container.go:723] "Killing container with a grace period" pod="default/problem-default-57997dd668-pmtj5" podUID=11c7d763-31db-454c-a147-310f076ab9fd containerName="problem" containerID="containerd://64f819e9094e2cd73f3b787748fd89cdc13e8f34f3a2e62beb6aa0aed61b3bad" gracePeriod=15

For a pod that is properly cleaned up

I0503 10:47:29.050498    1834 nodeshutdown_manager_linux.go:270] "Shutdown manager killing pod with gracePeriod" pod="default/good-default-7dd4d48896-6khj9" gracePeriod=15
I0503 10:47:29.051501    1834 kuberuntime_container.go:723] "Killing container with a grace period" pod="default/good-default-7dd4d48896-6khj9" podUID=c29e213b-f899-4843-953f-75d02b017d5f containerName="good" containerID="containerd://1bca503d475d3ff0f47d8b71d2218ab2cd48ba776dcd9ba1b67b5cb68dd6bd46" gracePeriod=15
I0503 10:47:40.908013    1834 nodeshutdown_manager_linux.go:281] "Shutdown manager finished killing pod" pod="default/good-default-7dd4d48896-6khj9"
I0503 10:47:41.303568    1834 kubelet.go:2112] "SyncLoop (PLEG): event for pod" pod="default/good-default-7dd4d48896-6khj9" event=&{ID:c29e213b-f899-4843-953f-75d02b017d5f Type:ContainerDied Data:1bca503d475d3ff0f47d8b71d2218ab2cd48ba776dcd9ba1b67b5cb68dd6bd46}
I0503 10:47:41.303628    1834 kubelet.go:2112] "SyncLoop (PLEG): event for pod" pod="default/good-default-7dd4d48896-6khj9" event=&{ID:c29e213b-f899-4843-953f-75d02b017d5f Type:ContainerDied Data:03a9354b962c1b1c35e6a57bb08f37bbc7fe5f72c62a45965312407bdde3ee56}

In the latter case, the pod is marked as terminated due to node shutdown, but the containers are also marked as terminated, and the pod is not listed in the endpoints.

tcolgate commented 2 years ago

The simplest solution would seem to be for endpoints to skip failed pods, but I'm not sure how valid that is. A second option would be for the containers to be marked as deleted (or "pre-deleted"), before the actual deletion starts, which is the time consuming bit). Semantically both could be tricky, vis behaviour of things like pre-stop hooks).

tcolgate commented 2 years ago

Looking at 1.21 vs 1.22.

In syncPod in 1.21, the pod status is set to whatever is returned by PodStatusFunc, which in the case of the node termination is an almost empty post status with the node termination and reasons. This matches what I see on terminated pods there.

In SyncPod in 1.22, the existing pod status is merged (in generateAPIPodStatus using convertStatusToAPIStatus(pod), with the requested pod status.

The newer version of the code is quite a bit harder to read (the pod termination code in 1.21 is an xplicit if block and clearly terminated, where in 1.22 termination appears to be handled more implicitly). I have also not verified (yet), if the statusmanager in 1.21 does merging).

It may be that 1.21 clumsier status handling results in wiping out all the container statuses which has, as a side effect, the removal from endpoints, where the more fine grained handling in 1.22 keeps the container status around until the actual container is shutdown (which may not actually happen if the kubelet is killed first).

In addition, this comment is new (I think). What happens if the kubelet is shutdown before these changes are sync'd? https://github.com/kubernetes/kubernetes/blob/release-1.22/pkg/kubelet/pod_workers.go#L902-L909

tcolgate commented 2 years ago

There's is an event suggesting that the deletion of a pod was cancelled by the taint controller, but that exists for both problem and good pods above.

tcolgate commented 2 years ago

A couple of other observations:

These pods are not restarted during the nodes subsequent boot (there is no container for the failed pod running on the node when the pod is observed in the problem state).
The pods do not appear to get garbage collected (whereas other pods that cleanly terminate do)

Running a controller to automatically harvest all pods that are PodFailed with the relevant Message should work as a workaround, but that prevents correctly shutdown pods from restarting after reboot (which has some speed benefits), and loses the benefit keeping pods around for information purposes. It might be possible for the controller UpdateStatus instead (but that could be tricky, and possibly interfere with other controllers).

aojea commented 2 years ago

In addition, this comment is new (I think). What happens if the kubelet is shutdown before these changes are sync'd? https://github.com/kubernetes/kubernetes/blob/release-1.22/pkg/kubelet/pod_workers.go#L902-L909

@smarterclayton can you enlighten us here?

aojea commented 2 years ago

@tcolgate do you think these 2 issues are the same? They look the same to me, or at least related

https://github.com/kubernetes/kubernetes/issues/109414

tcolgate commented 2 years ago

@tcolgate do you think these 2 issues are the same? They look the same to me, or at least related

~~If kube-proxy (or whatever is in use here), is relying on container status/readiness (as opposed to pod status), or entrypoints membership to add forwarding rules, then it is certainly possible.~~ On closer inspection, the IPs in the mention ticket are listed in notReadyAddresses in the endpoints, where as in my case, they are listed as ready. I may be related, but tangentially. The description of https://github.com/kubernetes/kubernetes/pull/102344 definitely sounds in the right ball park though, it could well be the same commit.

breunigs commented 2 years ago

FWIW, https://github.com/kubernetes/kubernetes/issues/108594 sounds more closely related to the description in this ticket. It was cherry-picked into 1.22.9, but the initial description here mentions 1.22.8.

tcolgate commented 2 years ago

Yes, that looks exactly the same. I'll push for 1.22.9 in GKE or a cherry-pick of the related fix.

navidnadali commented 2 years ago

Hi All - just to confirm we are facing this on multiple production clusters in GKE (#109414) definitely the same issue and since upgrading to v1.22.7

bowei commented 2 years ago

/cc @bowei

mirobertod commented 2 years ago

Hi, just to confirm we are also facing this on our cluster in GKE since upgrading to v1.22

Jeskz0rd commented 2 years ago

I can also confirm the same behaviour on "1.22.9-gke.1300" and "1.23.6-gke.1500"

endocrimes commented 2 years ago

/assign @bobbypage

aojea commented 2 years ago

/assign

this is going to be fixed by https://github.com/kubernetes/kubernetes/pull/110115

dcbw commented 2 years ago

/triage accepted

aojea commented 2 years ago

This has been fixed by https://github.com/kubernetes/kubernetes/pull/110255 /close

feel free to reopen if I was wrong

k8s-ci-robot commented 2 years ago

@aojea: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/109718#issuecomment-1143414770): >This has been fixed by https://github.com/kubernetes/kubernetes/pull/110255 >/close > >feel free to reopen if I was wrong Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

yylt commented 2 years ago

It may be caused by NodeLifeCycle controller, and fixed easyer in NLC controller.

tcolgate commented 2 years ago

Google support suggested that gke version 1.22.9-2000 should fix this (unfortunately this is largely an opaque version number, the release notes just link to the 1.22.9 release notes and do not indicate any additional patches). it did not, and due to some issues with our support contract I am now basically unable to update the ticket.

yylt commented 2 years ago

@tcolgate maybe should reopen this?

mirobertod commented 2 years ago

It looks the fix has been released only for v1.25 (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.25.md#bug-or-regression-3)

I don't see any mention about the fix of the issue in the changelog for other releases, can someone confirm it?

pacoxu commented 2 years ago

This is not included in the changelog due to missing release-note in cherry-pick PRs. The PR above is cherry-picked. Not sure if we need to update the Changelog.

kubernetes / kubernetes