Open doctorpangloss opened 2 years ago
This looks like it might be the same issue as https://github.com/projectcalico/calico/issues/5828, which was fixed by this PR: https://github.com/projectcalico/calico/pull/6656 Unfortunately, it hasn't made into any release just yet, it is scheduled for Calico v3.25
Closing as fixed by https://github.com/projectcalico/calico/pull/6656
This issue persists with v3.26.4
which should include this change.
I think the issue now is that we no longer use https://github.com/projectcalico/calico/blob/master/node/windows-packaging/CalicoWindows/kubernetes/kubelet-service.ps1 for running kubelet, but rather it is installed by this script maintained by the kubernetes sig-windows group: https://github.com/kubernetes-sigs/sig-windows-tools/blob/master/hostprocess/PrepareNode.ps1
If I'm understanding things correctly, this is a problem with Windows containerization, in that it doesn't prioritize calico pods on reboots (at least the original issue would happen on node reboot, @doctorpangloss could you please try to provide more details as to when this is being hit?). Not sure there's much we can do as a workaround, but you could also try to replace the kubelet service with the one from the calico repo (and see if the inclusion of Wait-ForCalicoInit()
does solve it for you).
@doctorpangloss were you able to test this?
Because I use hostprocess containers to deploy Calico, what would be the most sensible way to use Wait-ForCalicoInit
?
The issue still occurs for 3.26.4, kubernetes 1.29.7:
{
"namespace": "...",
"podName": "dinkydiner-unity-deployment-85d8f5bcbc-wg44f",
"reason": "FailedKillPod",
"message": "error killing pod: failed to \"KillPodSandbox\" for \"206c8fee-4106-4f8e-b6a6-7bdcbba9bdd6\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to remove network namespace for sandbox \\\"2e9f06d178440fa214f9bf529a05fd91b3017e03d506ef34b08891d6541e708f\\\": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\\\"Success\\\":false,\\\"Error\\\":\\\"The specified request is unsupported. \\\",\\\"ErrorCode\\\":2151350293}\"",
"count": 6493,
"lastTimestamp": "2024-09-10T17:33:14Z"
}
I still see this issue
$ kubectl describe pods -n ... dinkydiner-unity-deployment-85d8f5bcbc-rmrpg
Name: dinkydiner-unity-deployment-85d8f5bcbc-rmrpg
Namespace: ...
Priority: 0
Service Account: default
Node: .../...
Start Time: Fri, 30 Aug 2024 18:08:14 -0700
Labels: app.kubernetes.io/instance=dinkydiner-unity-deployment
app.kubernetes.io/name=unity-deployment
pod-template-hash=85d8f5bcbc
Annotations: appmana.artifactId: .../dinkydiner
appmana.project: dinkydiner
cni.projectcalico.org/containerID: fac9a792aa7f303f8e9225d493d28335bca9e8445e62b119684fef9b591cf090
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
Status: Terminating (lasts 24h)
Termination Grace Period: 30s
IP: 10.3.245.169
IPs:
IP: 10.3.245.169
Controlled By: ReplicaSet/dinkydiner-unity-deployment-85d8f5bcbc
Containers:
dinkydiner-unity-deployment:
Container ID: containerd://bc3286c5a71bc3eb73357e51846eb06471fd0b0f23bd46bb6f2ea33ff081ad14
Image: ...
State: Terminated
Reason: Error
Exit Code: -1073741510
Started: Sun, 08 Sep 2024 15:29:46 -0700
Finished: Mon, 09 Sep 2024 10:23:21 -0700
Ready: False
Restart Count: 2
Limits:
microsoft.com/directx: 1
Requests:
cpu: 2
memory: 1000Mi
microsoft.com/directx: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6kz8c (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
...
QoS Class: Burstable
Node-Selectors: kubernetes.io/arch=amd64
kubernetes.io/os=windows
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/os=windows:NoSchedule
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
nvidia.com/gpu=present:NoSchedule
Topology Spread Constraints: kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/instance=dinkydiner-unity-deployment,app.kubernetes.io/name=unity-deployment
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedKillPod 3s (x6596 over 24h) kubelet error killing pod: failed to "KillPodSandbox" for "dcf6df12-53af-4b8f-8354-57df98d673f2" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to remove network namespace for sandbox \"fac9a792aa7f303f8e9225d493d28335bca9e8445e62b119684fef9b591cf090\": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
However, the task is definitely not running:
$ ctr -n k8s.io t ls | grep bc3286c5a71bc3eb73357e51846eb06471fd0b0f23bd46bb6f2ea33ff081ad14
(it's empty)
Would it make more sense to author a descheduler policy that force deletes pods with this error?
Because I use hostprocess containers to deploy Calico, what would be the most sensible way to use Wait-ForCalicoInit?
I meant to stop and remove the kubelet service (if installed from https://github.com/kubernetes-sigs/sig-windows-tools/blob/master/hostprocess/PrepareNode.ps1), then install it running this script: https://github.com/projectcalico/calico/blob/master/node/windows-packaging/CalicoWindows/kubernetes/install-kube-services.ps1 (with a caveat that it may be outdated, appreciate feedback if you find issues). All of this needs to be done on the host, as that's where kubelet runs.
Though I'm not sure if that would in fact solve this problem.
Would it make more sense to author a descheduler policy that force deletes pods with this error?
Could you elaborate? My k8s "noobness" might be showing, but is that something possible to do via configuration? Or did you mean for us to write such a tool? Just asking to understand better where to begin...
@doctorpangloss a ping about ^
@coutinhop
Could you elaborate?
I have a CronJob
that finds pods suffering from this error and force terminates them:
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: pod-cleanup-job
spec:
schedule: "*/2 * * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: ...
nodeSelector:
kubernetes.io/os: linux
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
# calico 3.26 fixes
kubectl get pods --all-namespaces -o json | jq -r '
.items[] |
select(.metadata.deletionTimestamp != null) |
select(
.status.phase == "Running" or
.status.phase == "Failed" or
(.status.phase == "Terminating" and (now - (.metadata.deletionTimestamp | fromdateiso8601)) > 3600)
) |
select(
(.status.containerStatuses[0].state.terminated.reason == "Error" and .status.containerStatuses[0].state.terminated.exitCode == -1073741510) or
(.status.containerStatuses[0].state.terminated.reason == "StartError" and .status.containerStatuses[0].state.terminated.exitCode == 128) or
.status.containerStatuses[0].state.waiting.reason == "ContainerCreating"
) |
"\(.metadata.namespace) \(.metadata.name)"
' | while read namespace pod; do
if kubectl get events --field-selector involvedObject.name=$pod -n $namespace | grep -q 'FailedKillPod.*hcnDeleteNamespace.*The specified request is unsupported'; then
echo "Forcefully deleting pod $pod in namespace $namespace due to FailedKillPod"
kubectl delete pod $pod -n $namespace --force --grace-period=0
elif kubectl get events --field-selector involvedObject.name=$pod -n $namespace | grep -q 'StartError.*failed to create containerd task: failed to create shim task: hcs::CreateComputeSystem.*The endpoint was not found'; then
echo "Forcefully deleting pod $pod in namespace $namespace due to StartError"
kubectl delete pod $pod -n $namespace --force --grace-period=0
elif [ "$(kubectl get pod $pod -n $namespace -o jsonpath='{.status.phase}')" == "Terminating" ] && [ "$(kubectl get pod $pod -n $namespace -o jsonpath='{.status.containerStatuses[0].state.waiting.reason}')" == "ContainerCreating" ]; then
echo "Forcefully deleting pod $pod in namespace $namespace due to stuck in Terminating state"
kubectl delete pod $pod -n $namespace --force --grace-period=0
fi
done
I wish I comprehended what the error was actually saying or what was going wrong.
Thanks @doctorpangloss! If possible, could you try this procedure I mentioned:
I meant to stop and remove the kubelet service (if installed from https://github.com/kubernetes-sigs/sig-windows-tools/blob/master/hostprocess/PrepareNode.ps1), then install it running this script: https://github.com/projectcalico/calico/blob/master/node/windows-packaging/CalicoWindows/kubernetes/install-kube-services.ps1 (with a caveat that it may be outdated, appreciate feedback if you find issues). All of this needs to be done on the host, as that's where kubelet runs.
This can at least help us narrow down a root cause, if it fixes things...
In the meantime, I can look into having some similar clean up added to calico...
Terminating pods during regular deployment scale down on a Windows Calico worker gets stuck on an unusual error.
Expected Behavior
The pods should cleanly terminate. They sometimes do.
Current Behavior
The underlying process appears to have successfully exited.
Possible Solution
Not sure.
Steps to Reproduce (for bugs)
Challenging to reproduce. One note is that I run the core process as a Windows service in the Windows container. This may be interacting to cause a race condition.
Context
I experience a lot of issues using containers on Windows, and I am sophisticated, so I am not sure if Calico is strictly to blame here.
Your Environment
(1.22 on AWS EKS + on premises nodes)