Closed tcolgate closed 2 years ago
We are aware that running production workloads on pre-emptible nodes is "not recommended", however this behaviour was not present in 1.21, and appears fundamentally wrong.
This pods kill during node termination do not seem to get a DeletionTimestamp, so I think the endpoints controller considers them up. Presumably they should either be considered deleted, or atleaast have the IP unassiged?
The status update in the nodeshutdown manager does not seem sufficient on it's own to have the pod removed from endpoints. The pod is marked as failed, with the reason also updated, but not deleted, or otherwise updated. Is something else expected to release the IP and actual delete/evict the pod? The state of my pod suggests that, after quite a bit of time, this is the only update made to the pods on a terminated node.
The intent of this code hasn't changed from 1.21 (where we have not seen this issue). It could be a change in the endpoints controller, or in the kubelet podworkers syncPod logic (or possibly a timeout on the server side).
It is odd that the endpoints slice shows the pod as ready and not terminating. Interestingly, the IP was used by two pods around the same time.
docs-default-7c4ff6b58f-kqz9p 1/1 Terminated 0 3d2h 10.4.1.7 gke-europe-west1-pre-n1-4cpu-26gb-0298d3a1-fcsj <none> <none>
experiment-stats-worker-ppos-large-default-7b59cbc4cf-5lb49 0/1 Completed 0 3d2h 10.4.1.7 gke-europe-west1-pre-n1-4cpu-26gb-0298d3a1-fcsj <none> <none>
both claim to have been terminated at node shutdown, which seems rather implausible. THis is the second pod. Though the second pod did stop before the first started. The phase and message seem to have been overwritten, despite the fact that the pod wasn't actual running at the time (It was already completed).
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2022-04-26T10:46:17Z"
generateName: experiment-stats-worker-ppos-large-default-7b59cbc4cf-
name: experiment-stats-worker-ppos-large-default-7b59cbc4cf-5lb49
namespace: default
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-04-26T10:49:44Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-04-27T10:49:55Z"
message: 'containers with unready status: [experiment-stats-worker-ppos-large]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-04-27T10:49:55Z"
message: 'containers with unready status: [experiment-stats-worker-ppos-large]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-04-26T10:46:25Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://6cd366e2c7c5dad656a8ac7b158f2662c6da682788e14b479df2b6086aa7028b
image: eu.gcr.io/qubit-registry/deploy-containers/experiment-stats-worker_qubit-apps-eu-stg_europe-west1:108ee3fccdcf10e244909e15a00022b59a7e4506
imageID: eu.gcr.io/qubit-registry/deploy-containers/experiment-stats-worker_qubit-apps-eu-stg_europe-west1@sha256:18e73da2ef4be2e926fbb5d7863f45e6ac0240b7fcdb2d220a7fdba22a0ca8e1
lastState: {}
name: experiment-stats-worker-ppos-large
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://6cd366e2c7c5dad656a8ac7b158f2662c6da682788e14b479df2b6086aa7028b
exitCode: 0
finishedAt: "2022-04-27T10:49:52Z"
reason: Completed
startedAt: "2022-04-26T10:49:44Z"
hostIP: 10.132.0.57
initContainerStatuses:
- containerID: containerd://2293c9886aaaca7067cd1cf4f4f3ff1b568b8b7a28e0ba8fb77954ad6a9450e0
image: eu.gcr.io/qubit-registry/deploy-containers/experiment-stats-worker_qubit-apps-eu-stg_europe-west1:108ee3fccdcf10e244909e15a00022b59a7e4506
imageID: eu.gcr.io/qubit-registry/deploy-containers/experiment-stats-worker_qubit-apps-eu-stg_europe-west1@sha256:18e73da2ef4be2e926fbb5d7863f45e6ac0240b7fcdb2d220a7fdba22a0ca8e1
lastState: {}
name: experiment-stats-worker-ppos-large-beanbag-init
ready: true
restartCount: 0
state:
terminated:
containerID: containerd://2293c9886aaaca7067cd1cf4f4f3ff1b568b8b7a28e0ba8fb77954ad6a9450e0
exitCode: 0
finishedAt: "2022-04-26T10:49:32Z"
reason: Completed
startedAt: "2022-04-26T10:49:31Z"
message: Pod was terminated in response to imminent node shutdown.
phase: Failed
podIP: 10.4.1.7
podIPs:
- ip: 10.4.1.7
qosClass: Burstable
reason: Terminated
startTime: "2022-04-26T10:46:25Z"
r in the kubelet podworkers syncPod logic (or possibly a timeout on the server side).
that logic had a major refactor and several changes about the pod phased on termination
https://github.com/kubernetes/kubernetes/pull/103668 https://github.com/kubernetes/kubernetes/pull/104817 https://github.com/kubernetes/kubernetes/pull/104918
r in the kubelet podworkers syncPod logic (or possibly a timeout on the server side).
that logic had a major refactor and several changes about the pod phased on termination
103668 #104817 #104918
I'm trying to walk through that code at the moment, the definition of Runtime-only pod doesn't seem super clear (they aren't submitted to the node shutdown pod kill marked as such, so either it doesn't apply, or they get marked as such by the cache layer in some process I've not seen yet).
The final pod status is "Terminated", (rather than "Terminating"), so I don't /think/ the kubelet is killed before the process completes. Either way, a Terminated pod shouldn't be active in an endpoints list (should it?).
I've tried to manually recreate the issue but forcing a systemctl reboot
on a node. In this case I see the same reason and message, but all the containers show as stopped, and it is removed from the endpoints.
The main difference between this test and a true preemption is time, the preempted machine has a specific time limit to complete the shutdown in. It may be that the node is being killed before the container statuses are updated (though the pod is updated as Terminated), could also be that there is an error during the container shutodwn, but I need to get to the kubelet logs to see that.
Yes, so I think the bit that feels wrong is that in the original problem pod we have
status:
containerStatuses:
- containerID: containerd://2bf2d59e54e86e4fa0921813eb58e26defdbb18c731e00519645376dd7f6a024
image: XXXX
imageID: XXXX
lastState: {}
name: docs
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2022-04-27T10:53:47Z"
Which is enough for the endpoints and endpointsslice controllers to thing that that should be part of the active endpoints lists, even though the actual pod is in phase: Failed.
this is what the endpoins controllers use to determine if the pod should be listed
The other possible culprit here is that these are containerd, not docker, nodes. Unfortunately other bugs mean we can't revert the base OS to the docker version. It's possible that it is the KillPod logic n pkg/kubelet/kuberuntime/kuberuntime_manager.go
. The kubelet was clearly alive long enough to mark the pod as Terminated, so I'm assuming it was around long enough to at least attempt to stop the containers.
Cannot currently access the kubelet logs, waiting on vendor feedback on that.
will tag both sigs involved in the meantime, we we'll wait for the logs
/sig node /sig network
this is what the endpoins controllers use to determine if the pod should be listed
Yes, endpoints uses something similar (though not quite as clearly stated), since the pod isn't terminating (it is already terminated, and Failed), and the pod has no deletion timestamp (it isn't deleted, I believe that is intentional to allow the pod to be visible to related jobs (for instance).
I can't decide if it is ever valid for pod.Phase = "Failed" pods to ever appear in endpoints
I have managed to retrieve the kubelet logs around the node terminations. It looks like the pod deletion does not complete before the node is shutdow. The pod is marked as failed, but this appears to happen before the containers are marked as failed, which is leaving the pod in a state that the endpoints controller (as show above), still considers capable of serving traffic.
For a problem pod (that still gets listed in endpoints), the following is present in the kubelet log
I0503 10:47:29.051548 1834 nodeshutdown_manager_linux.go:270] "Shutdown manager killing pod with gracePeriod" pod="default/problem-default-57997dd668-pmtj5" gracePeriod=15
I0503 10:47:29.054407 1834 kuberuntime_container.go:723] "Killing container with a grace period" pod="default/problem-default-57997dd668-pmtj5" podUID=11c7d763-31db-454c-a147-310f076ab9fd containerName="problem" containerID="containerd://64f819e9094e2cd73f3b787748fd89cdc13e8f34f3a2e62beb6aa0aed61b3bad" gracePeriod=15
For a pod that is properly cleaned up
I0503 10:47:29.050498 1834 nodeshutdown_manager_linux.go:270] "Shutdown manager killing pod with gracePeriod" pod="default/good-default-7dd4d48896-6khj9" gracePeriod=15
I0503 10:47:29.051501 1834 kuberuntime_container.go:723] "Killing container with a grace period" pod="default/good-default-7dd4d48896-6khj9" podUID=c29e213b-f899-4843-953f-75d02b017d5f containerName="good" containerID="containerd://1bca503d475d3ff0f47d8b71d2218ab2cd48ba776dcd9ba1b67b5cb68dd6bd46" gracePeriod=15
I0503 10:47:40.908013 1834 nodeshutdown_manager_linux.go:281] "Shutdown manager finished killing pod" pod="default/good-default-7dd4d48896-6khj9"
I0503 10:47:41.303568 1834 kubelet.go:2112] "SyncLoop (PLEG): event for pod" pod="default/good-default-7dd4d48896-6khj9" event=&{ID:c29e213b-f899-4843-953f-75d02b017d5f Type:ContainerDied Data:1bca503d475d3ff0f47d8b71d2218ab2cd48ba776dcd9ba1b67b5cb68dd6bd46}
I0503 10:47:41.303628 1834 kubelet.go:2112] "SyncLoop (PLEG): event for pod" pod="default/good-default-7dd4d48896-6khj9" event=&{ID:c29e213b-f899-4843-953f-75d02b017d5f Type:ContainerDied Data:03a9354b962c1b1c35e6a57bb08f37bbc7fe5f72c62a45965312407bdde3ee56}
In the latter case, the pod is marked as terminated due to node shutdown, but the containers are also marked as terminated, and the pod is not listed in the endpoints.
The simplest solution would seem to be for endpoints to skip failed pods, but I'm not sure how valid that is. A second option would be for the containers to be marked as deleted (or "pre-deleted"), before the actual deletion starts, which is the time consuming bit). Semantically both could be tricky, vis behaviour of things like pre-stop hooks).
Looking at 1.21 vs 1.22.
In syncPod in 1.21, the pod status is set to whatever is returned by PodStatusFunc, which in the case of the node termination is an almost empty post status with the node termination and reasons. This matches what I see on terminated pods there.
In SyncPod in 1.22, the existing pod status is merged (in generateAPIPodStatus using convertStatusToAPIStatus(pod), with the requested pod status.
The newer version of the code is quite a bit harder to read (the pod termination code in 1.21 is an xplicit if block and clearly terminated, where in 1.22 termination appears to be handled more implicitly). I have also not verified (yet), if the statusmanager in 1.21 does merging).
It may be that 1.21 clumsier status handling results in wiping out all the container statuses which has, as a side effect, the removal from endpoints, where the more fine grained handling in 1.22 keeps the container status around until the actual container is shutdown (which may not actually happen if the kubelet is killed first).
In addition, this comment is new (I think). What happens if the kubelet is shutdown before these changes are sync'd? https://github.com/kubernetes/kubernetes/blob/release-1.22/pkg/kubelet/pod_workers.go#L902-L909
There's is an event suggesting that the deletion of a pod was cancelled by the taint controller, but that exists for both problem and good pods above.
A couple of other observations:
Running a controller to automatically harvest all pods that are PodFailed with the relevant Message should work as a workaround, but that prevents correctly shutdown pods from restarting after reboot (which has some speed benefits), and loses the benefit keeping pods around for information purposes. It might be possible for the controller UpdateStatus instead (but that could be tricky, and possibly interfere with other controllers).
In addition, this comment is new (I think). What happens if the kubelet is shutdown before these changes are sync'd? https://github.com/kubernetes/kubernetes/blob/release-1.22/pkg/kubelet/pod_workers.go#L902-L909
@smarterclayton can you enlighten us here?
@tcolgate do you think these 2 issues are the same? They look the same to me, or at least related
@tcolgate do you think these 2 issues are the same? They look the same to me, or at least related
If kube-proxy (or whatever is in use here), is relying on container status/readiness (as opposed to pod status), or entrypoints membership to add forwarding rules, then it is certainly possible. On closer inspection, the IPs in the mention ticket are listed in notReadyAddresses in the endpoints, where as in my case, they are listed as ready. I may be related, but tangentially. The description of https://github.com/kubernetes/kubernetes/pull/102344 definitely sounds in the right ball park though, it could well be the same commit.
FWIW, https://github.com/kubernetes/kubernetes/issues/108594 sounds more closely related to the description in this ticket. It was cherry-picked into 1.22.9, but the initial description here mentions 1.22.8.
Yes, that looks exactly the same. I'll push for 1.22.9 in GKE or a cherry-pick of the related fix.
Hi All - just to confirm we are facing this on multiple production clusters in GKE (#109414) definitely the same issue and since upgrading to v1.22.7
/cc @bowei
Hi, just to confirm we are also facing this on our cluster in GKE since upgrading to v1.22
I can also confirm the same behaviour on "1.22.9-gke.1300" and "1.23.6-gke.1500"
/assign @bobbypage
/assign
this is going to be fixed by https://github.com/kubernetes/kubernetes/pull/110115
/triage accepted
This has been fixed by https://github.com/kubernetes/kubernetes/pull/110255 /close
feel free to reopen if I was wrong
@aojea: Closing this issue.
It may be caused by NodeLifeCycle controller, and fixed easyer in NLC controller.
related issue https://github.com/kubernetes/kubernetes/issues/109998
Google support suggested that gke version 1.22.9-2000 should fix this (unfortunately this is largely an opaque version number, the release notes just link to the 1.22.9 release notes and do not indicate any additional patches). it did not, and due to some issues with our support contract I am now basically unable to update the ticket.
@tcolgate maybe should reopen this?
It looks the fix has been released only for v1.25 (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.25.md#bug-or-regression-3)
I don't see any mention about the fix of the issue in the changelog for other releases, can someone confirm it?
This is not included in the changelog due to missing release-note in cherry-pick PRs. The PR above is cherry-picked. Not sure if we need to update the Changelog.
What happened?
On our GKE cluster (using preemptible nodes), pods are correctly marked as "terminated", but are still showing as "ready" in the endpoints list, causing traffic to route to them.
Deleting the Terminated pods removes them from the endpoints list.
The Terminated pods reappear after a node preemption.
What did you expect to happen?
Terminated pods should not be listed in endpoints.
How can we reproduce it (as minimally and precisely as possible)?
Create a GKE cluster using k8s 1.22.8-gke.200 Add a node group using preemptible nodes, on Container OS and containerd. Create a deployment and service. Check endpoint list after a preemption event.
Anything else we need to know?
Here is the state of the pod, endpoints, edpointsslice, and the service definition.
Example pod (reason shows terminated due to node shutdown), redacted image name details and some args and command details:
endpoints:
endpointsslice
and the service
Kubernetes version
Cloud provider
GKE,
OS version
Nodes are cos_containerd 1.22.8-gke.200
Install tools
GKE
Container runtime (CRI) and version (if applicable)
containerd
Related plugins (CNI, CSI, ...) and versions (if applicable)
GKE defaults