kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.58k stars 1.31k forks source link

kube-apiserver pod stuck in "CreateContainerError" status #9879

Closed mslga closed 10 months ago

mslga commented 11 months ago

What steps did you take and what happened?

After deploy cluster with cluster-api vsphere kube-apiserver pod stuck in "CreateContainerError" status

kubectl get po -A

NAMESPACE     NAME                                   READY   STATUS                 RESTARTS      AGE
kube-system   coredns-5dd5756b68-57dz9               0/1     Pending                0             28m
kube-system   coredns-5dd5756b68-nppbl               0/1     Pending                0             28m
kube-system   etcd-kiv-cp-v48fw                      1/1     Running                1 (28m ago)   28m
kube-system   kube-apiserver-kiv-cp-v48fw            0/1     CreateContainerError   0             28m
kube-system   kube-controller-manager-kiv-cp-v48fw   1/1     Running                1 (28m ago)   28m
kube-system   kube-proxy-8l65v                       1/1     Running                0             28m
kube-system   kube-scheduler-kiv-cp-v48fw            1/1     Running                1 (28m ago)   27m
kube-system   kube-vip-kiv-cp-v48fw                  1/1     Running                0             28m

Api server endpoint is available and there are two kube-apiserver containers on the CP node: one in Running status and one in Exited status

crictl ps -a

CONTAINER           IMAGE               CREATED                  STATE               NAME                      ATTEMPT             POD ID              POD
dc58c937225c1       bb5e0dde9054c       Less than a second ago   Exited              kube-apiserver            0                   f5de5d0666823       kube-apiserver-kiv-cp-v48fw
...
7c54f31e3fc8b       bb5e0dde9054c       28 minutes ago           Running             kube-apiserver   

Containerd logs

Dec 14 16:02:37 kiv-cp-v48fw containerd[4977]: time="2023-12-14T16:02:37.302014512+06:00" level=info msg="CreateContainer within sandbox \"f5de5d066682354fa0679cc95a39a5285657f0eb43237f26194426d8fad3636d\" for container &ContainerMetadata{Name:kube-apiserver,Attempt:1,}"
Dec 14 16:02:37 kiv-cp-v48fw containerd[4977]: time="2023-12-14T16:02:37.302725597+06:00" level=error msg="CreateContainer within sandbox \"f5de5d066682354fa0679cc95a39a5285657f0eb43237f26194426d8fad3636d\" for &ContainerMetadata{Name:kube-apiserver,Attempt:1,} failed" error="failed to reserve container name \"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\": name \"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\" is reserved for \"7c54f31e3fc8b87a115f11a214efe060c94e93ae822c9819a472bfb382c2839c\""
Dec 14 16:02:48 kiv-cp-v48fw containerd[4977]: time="2023-12-14T16:02:48.304873160+06:00" level=info msg="CreateContainer within sandbox \"f5de5d066682354fa0679cc95a39a5285657f0eb43237f26194426d8fad3636d\" for container &ContainerMetadata{Name:kube-apiserver,Attempt:1,}"
Dec 14 16:02:48 kiv-cp-v48fw containerd[4977]: time="2023-12-14T16:02:48.305393178+06:00" level=error msg="CreateContainer within sandbox \"f5de5d066682354fa0679cc95a39a5285657f0eb43237f26194426d8fad3636d\" for &ContainerMetadata{Name:kube-apiserver,Attempt:1,} failed" error="failed to reserve container name \"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\": name \"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\" is reserved for \"7c54f31e3fc8b87a115f11a214efe060c94e93ae822c9819a472bfb382c2839c\""
Dec 14 16:03:02 kiv-cp-v48fw containerd[4977]: time="2023-12-14T16:03:02.304183604+06:00" level=info msg="CreateContainer within sandbox \"f5de5d066682354fa0679cc95a39a5285657f0eb43237f26194426d8fad3636d\" for container &ContainerMetadata{Name:kube-apiserver,Attempt:1,}"
Dec 14 16:03:02 kiv-cp-v48fw containerd[4977]: time="2023-12-14T16:03:02.305103998+06:00" level=error msg="CreateContainer within sandbox \"f5de5d066682354fa0679cc95a39a5285657f0eb43237f26194426d8fad3636d\" for &ContainerMetadata{Name:kube-apiserver,Attempt:1,} failed" error="failed to reserve container name \"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\": name \"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\" is reserved for \"7c54f31e3fc8b87a115f11a214efe060c94e93ae822c9819a472bfb382c2839c\""

Kubelet logs

Dec 14 17:08:47 kiv-cp-v48fw kubelet[5021]: I1214 17:08:47.296285    5021 scope.go:117] "RemoveContainer" containerID="dc58c937225c1d7ae8542004de00a7bd266aad5e7c9b54dddc77f1aedb0c8035"
Dec 14 17:08:47 kiv-cp-v48fw kubelet[5021]: E1214 17:08:47.303868    5021 remote_runtime.go:319] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to reserve container name \"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\": name \"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\" is reserved for \"7c54f31e3fc8b87a115f11a214efe060c94e93ae822c9819a472bfb382c2839c\"" podSandboxID="f5de5d066682354fa0679cc95a39a5285657f0eb43237f26194426d8fad3636d"
Dec 14 17:08:47 kiv-cp-v48fw kubelet[5021]: E1214 17:08:47.304101    5021 kuberuntime_manager.go:1209] container &Container{Name:kube-apiserver,Image:registry.k8s.io/kube-apiserver:v1.28.0,Command:[kube-apiserver --advertise-address=X.X.X.X --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --cloud-provider=external --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-account-signing-key-file=/etc/kubernetes/pki/sa.key --service-cluster-ip-range=10.96.0.0/12 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{250 -3} {<nil>} 250m DecimalSI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:ca-certs,ReadOnly:true,MountPath:/etc/ssl/certs,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:etc-ca-certificates,ReadOnly:true,MountPath:/etc/ca-certificates,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:etc-pki,ReadOnly:true,MountPath:/etc/pki,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:k8s-certs,ReadOnly:true,MountPath:/etc/kubernetes/pki,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:usr-local-share-ca-certificates,ReadOnly:true,MountPath:/usr/local/share/ca-certificates,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:usr-share-ca-certificates,ReadOnly:true,MountPath:/usr/share/ca-certificates,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/livez,Port:{0 6443 },Host:X.X.X.X,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:10,TimeoutSeconds:15,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:8,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/readyz,Port:{0 6443 },Host:X.X.X.X,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:15,PeriodSeconds:1,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/livez,Port:{0 6443 },Host:X.X.X.X,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:10,TimeoutSeconds:15,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:24,TerminationGracePeriodSeconds:nil,},ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod kube-apiserver-kiv-cp-v48fw_kube-system(37b2b42d18b011bd117353a3847450aa): CreateContainerError: failed to reserve container name "kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1": name "kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1" is reserved for "7c54f31e3fc8b87a115f11a214efe060c94e93ae822c9819a472bfb382c2839c"
Dec 14 17:08:47 kiv-cp-v48fw kubelet[5021]: E1214 17:08:47.304172    5021 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CreateContainerError: \"failed to reserve container name \\\"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\\\": name \\\"kube-apiserver_kube-apiserver-kiv-cp-v48fw_kube-system_37b2b42d18b011bd117353a3847450aa_1\\\" is reserved for \\\"7c54f31e3fc8b87a115f11a214efe060c94e93ae822c9819a472bfb382c2839c\\\"\"" pod="kube-system/kube-apiserver-kiv-cp-v48fw" podUID="37b2b42d18b011bd117353a3847450aa"

What did you expect to happen?

I assume that in normal behavior the container should be deleted, but it's like it's stuck in status Exited because a new container has already been created that uses the same podSandbox

I found just workaround: ssh to node and delete the container with Exited status

or add to postKubeadmCommands

crictl rm $(crictl ps -a | awk '/kube-apiserver/ && /Exited/ {print $1}')

Could this be a bug or is there already a solution to this problem?

Cluster API version

v1.5.3

Kubernetes version

v1.28.0

Anything else you would like to add?

CAPV v1.8.4

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

k8s-ci-robot commented 11 months ago

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
chrischdi commented 10 months ago

Is this a reproducible issue or was this a one-time hit?

Also: looks like its not specific to CAPI (maybe even not to CAPV), but more a bug in the used image which comes back to image-builder.

mslga commented 10 months ago

I ran additional tests on vSphere cluster "A" (the one that originally encountered the error). I used different OVAs that I downloaded from https://github.com/kubernetes-sigs/cluster-api-provider-vsphere?tab=readme-ov-file#kubernetes-versions-with-published-ovas

ubuntu-2204-kube-v1.27.3 ubuntu-2204-kube-v1.28.0

In both cases this error was present. And this error is only related to the api-server pod. If you continue to install the necessary conrollers and plugins in the cluster, this error does not occur with other pods

I went ahead and uploaded these images to another vSphere cluster "B". And there this error was not present. When creating a CP node api-server pod did not even recreate:

crictl ps -a

CONTAINER           IMAGE               CREATED              STATE               NAME                      ATTEMPT             POD ID              POD
1320f4110e7f2       73deb9a3f7025       2 seconds ago        Running             etcd                      1                   89d26eb0056c8       etcd-kiv-cp-fsvvd
61d7e97aeb35d       ed5bba5d71b95       44 seconds ago       Running             kube-vip                  1                   3b82d5b60e1d9       kube-vip-kiv-cp-fsvvd
5aaa71e16620b       4be79c38a4bab       56 seconds ago       Running             kube-controller-manager   0                   1cd72b77449dc       kube-controller-manager-kiv-cp-fsvvd
5c99d55d8feb7       f6f496300a2ae       56 seconds ago       Running             kube-scheduler            0                   46ff881f080d7       kube-scheduler-kiv-cp-fsvvd
f8524cc86d0aa       73deb9a3f7025       56 seconds ago       Exited              etcd                      0                   89d26eb0056c8       etcd-kiv-cp-fsvvd
97475565488d5       ea1030da44aa1       About a minute ago   Running             kube-proxy                0                   13191d32e2c8c       kube-proxy-8rnkk
f72d4a681a5cb       ed5bba5d71b95       About a minute ago   Exited              kube-vip                  0                   3b82d5b60e1d9       kube-vip-kiv-cp-fsvvd
001dd5a308c9f       bb5e0dde9054c       About a minute ago   Running             kube-apiserver            0                   381cd2a560ab0       kube-apiserver-kiv-cp-fsvvd
3ed3ce6566fb3       f6f496300a2ae       2 minutes ago        Exited              kube-scheduler            0                   ac1ae0d29ea97       kube-scheduler-kiv-cp-fsvvd

It turns out that the problem may be related to the vSphere cluster, but I don't quite understand what exactly to check and where to look for the problem

There is no load on vSphere cluster "A", it is a new cluster

vSphere cluster "A" version: 8.0.1.00200 Build number: 21860503

vSphere cluster "B" version: 8.0.1.00200 Build number: 21860503

mslga commented 10 months ago

Problem solved.

It turned out that NTP was not configured on cluster "A" on the esxi hosts. Virtual machines received time settings from esxi hosts when they were created. And then the cluster api set the NTP settings on the virtual machines. So the API-server container was rebooted because of the received time offset.