kubernetes / kubeadm

Aggregator for issues filed against kubeadm
Apache License 2.0
3.76k stars 716 forks source link

while upgrading to 1.25.4 kubelet does not start the new etcd pod #2789

Closed raz3k closed 1 year ago

raz3k commented 1 year ago

What happened?

Hello,

We are using a single node untainted kubernetes 1.24.6 with some services installed that we are trying to upgrade to 1.25.4, during the upgrade the new etcd pod does not start and after timing out it puts back the old etcd manifest and it still does not come back up. It comes back up after a kubelet restart.

[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade/version] You have chosen to change the cluster version to "v1.25.4"
[upgrade/versions] Cluster version: v1.24.6
[upgrade/versions] kubeadm version: v1.25.4
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.25.4" (timeout: 5m0s)...
[upgrade/etcd] Upgrading to TLS for etcd
[upgrade/staticpods] Preparing for "etcd" upgrade
[upgrade/staticpods] Renewing etcd-server certificate
[upgrade/staticpods] Renewing etcd-peer certificate
[upgrade/staticpods] Renewing etcd-healthcheck-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-12-06-10-26-30/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 1 Pods for label selector component=etcd
[upgrade/staticpods] Component "etcd" upgraded successfully!
[upgrade/etcd] Waiting for etcd to become available
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests3426337766"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Renewing apiserver-etcd-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-12-06-10-26-30/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: failed to obtain static Pod hash for component kube-apiserver on Node master-worker-973aea025fb4e6b759d0: Get "https://10.10.10.205:10443/api/v1/namespaces/kube-system/pods/kube-apiserver-master-worker-973aea025fb4e6b759d0?timeout=10s": dial tcp 10.10.10.205:10443: connect: connection refused
To see the stack trace of this error execute with --v=5 or higher```

kubelet log with verbosity 5 can be found on this private link(http://ghitescu.ro/kubelet_logs.log), the upgrade starts at "Dec 06 10:25:22"

### What did you expect to happen?

Kubernetes upgrade to succeeed.

### How can we reproduce it (as minimally and precisely as possible)?

Install kubernetes 1.24.6 on a single node and add some services like minio, elk, postgres, ingress, kafka, grafana, rabbitmq, redis, spark, zookeeper (you need to add services on it otherwise it will upgrade)

### Anything else we need to know?

I have PodSecurity featuregate enabled, etcd encryption enabled and i'm using docker with cri-dockerd

### Kubernetes version

<details>

```console
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"b39bf148cd654599a52e867485c02c4f9d28b312", GitTreeState:"clean", BuildDate:"2022-09-21T13:19:24Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"b39bf148cd654599a52e867485c02c4f9d28b312", GitTreeState:"clean", BuildDate:"2022-09-21T13:12:04Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Local install in an openstack

OS version

```console # On Linux: $ cat /etc/os-release NAME="Oracle Linux Server" VERSION="8.6" ID="ol" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="8.6" PLATFORM_ID="platform:el8" PRETTY_NAME="Oracle Linux Server 8.6" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:oracle:linux:8:6:server" HOME_URL="https://linux.oracle.com/" BUG_REPORT_URL="https://bugzilla.oracle.com/" ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8" ORACLE_BUGZILLA_PRODUCT_VERSION=8.6 ORACLE_SUPPORT_PRODUCT="Oracle Linux" ORACLE_SUPPORT_PRODUCT_VERSION=8.6 $ uname -a Linux razvan-test-y.novalocal 4.18.0-372.32.1.0.1.el8_6.x86_64 kubernetes/kubernetes#1 SMP Tue Oct 25 10:51:38 PDT 2022 x86_64 x86_64 x86_64 GNU/Linux # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

kubeadm

Container runtime (CRI) and version (if applicable)

cri-dockerd 0.2.6 (d8accf7)

Related plugins (CNI, CSI, ...) and versions (if applicable)

flannel 0.17.0
k8s-ci-robot commented 1 year ago

@raz3k: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 1 year ago

@raz3k: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
neolit123 commented 1 year ago

/transfer kubeadm /kind support

hi, any logs on why etcd is not starting? we are not seeing such a promlem in our upgrade ci tests

neolit123 commented 1 year ago

I have PodSecurity featuregate enabled

maybe that's what's blocking it? try checking logs from all k8s components like the kcm

raz3k commented 1 year ago

etcd container is not getting created at all, not sure why, but I can't see any container created in docker ps with the name etcd.

kcm logs as follows:

I1206 10:26:44.751167       1 serving.go:348] Generated self-signed cert in-memory
I1206 10:26:45.059320       1 controllermanager.go:180] Version: v1.24.6
I1206 10:26:45.059337       1 controllermanager.go:182] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I1206 10:26:45.059751       1 dynamic_cafile_content.go:157] "Starting controller" name="request-header::/etc/kubernetes/pki/front-proxy-ca.crt"
I1206 10:26:45.059773       1 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
I1206 10:26:45.060128       1 secure_serving.go:210] Serving securely on 127.0.0.1:10257
I1206 10:26:45.060198       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I1206 10:26:45.060861       1 leaderelection.go:248] attempting to acquire leader lease kube-system/kube-controller-manager...
E1206 10:26:50.061848       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1206 10:26:58.313042       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E1206 10:27:06.648563       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1206 10:27:15.448956       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1206 10:27:24.474133       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I1206 10:27:43.806917       1 leaderelection.go:258] successfully acquired lease kube-system/kube-controller-manager
I1206 10:27:43.806977       1 event.go:294] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="razvan-test-y.novalocal_d2210695-811b-4fe6-9256-b283236fff5c became leader"
I1206 10:27:44.359784       1 shared_informer.go:255] Waiting for caches to sync for tokens
I1206 10:27:44.361268       1 endpoints_controller.go:178] Starting endpoint controller
I1206 10:27:44.361288       1 shared_informer.go:255] Waiting for caches to sync for endpoint
I1206 10:27:44.367182       1 controllermanager.go:593] Started "endpoint"
I1206 10:27:44.369311       1 controllermanager.go:593] Started "csrcleaner"
I1206 10:27:44.369388       1 cleaner.go:82] Starting CSR cleaner controller
I1206 10:27:44.372715       1 controllermanager.go:593] Started "tokencleaner"
I1206 10:27:44.372838       1 tokencleaner.go:118] Starting token cleaner controller
I1206 10:27:44.372845       1 shared_informer.go:255] Waiting for caches to sync for token_cleaner
I1206 10:27:44.372852       1 shared_informer.go:262] Caches are synced for token_cleaner
I1206 10:27:44.377328       1 node_lifecycle_controller.go:77] Sending events to api server
E1206 10:27:44.377349       1 core.go:211] failed to start cloud node lifecycle controller: no cloud provider provided
W1206 10:27:44.377356       1 controllermanager.go:571] Skipping "cloud-node-lifecycle"
I1206 10:27:44.378515       1 controllermanager.go:593] Started "clusterrole-aggregation"
I1206 10:27:44.378585       1 clusterroleaggregation_controller.go:194] Starting ClusterRoleAggregator
I1206 10:27:44.378591       1 shared_informer.go:255] Waiting for caches to sync for ClusterRoleAggregator
I1206 10:27:44.379970       1 controllermanager.go:593] Started "ttl-after-finished"
I1206 10:27:44.380029       1 ttlafterfinished_controller.go:109] Starting TTL after finished controller
I1206 10:27:44.380038       1 shared_informer.go:255] Waiting for caches to sync for TTL after finished
I1206 10:27:44.381161       1 controllermanager.go:593] Started "replicationcontroller"
I1206 10:27:44.381230       1 replica_set.go:205] Starting replicationcontroller controller
I1206 10:27:44.381235       1 shared_informer.go:255] Waiting for caches to sync for ReplicationController
I1206 10:27:44.382250       1 controllermanager.go:593] Started "podgc"
I1206 10:27:44.382340       1 gc_controller.go:92] Starting GC controller
I1206 10:27:44.382346       1 shared_informer.go:255] Waiting for caches to sync for GC
I1206 10:27:44.397113       1 controllermanager.go:593] Started "namespace"
I1206 10:27:44.397193       1 namespace_controller.go:200] Starting namespace controller
I1206 10:27:44.397199       1 shared_informer.go:255] Waiting for caches to sync for namespace
I1206 10:27:44.398694       1 controllermanager.go:593] Started "serviceaccount"
I1206 10:27:44.398800       1 serviceaccounts_controller.go:117] Starting service account controller
I1206 10:27:44.398806       1 shared_informer.go:255] Waiting for caches to sync for service account
I1206 10:27:44.401763       1 controllermanager.go:593] Started "garbagecollector"
I1206 10:27:44.401840       1 garbagecollector.go:149] Starting garbage collector controller
I1206 10:27:44.401855       1 shared_informer.go:255] Waiting for caches to sync for garbage collector
I1206 10:27:44.401889       1 graph_builder.go:289] GraphBuilder running
I1206 10:27:44.403919       1 controllermanager.go:593] Started "disruption"
I1206 10:27:44.404003       1 disruption.go:363] Starting disruption controller
I1206 10:27:44.404014       1 shared_informer.go:255] Waiting for caches to sync for disruption
I1206 10:27:44.405017       1 controllermanager.go:593] Started "statefulset"
I1206 10:27:44.405109       1 stateful_set.go:147] Starting stateful set controller
I1206 10:27:44.405115       1 shared_informer.go:255] Waiting for caches to sync for stateful set
I1206 10:27:44.405989       1 controllermanager.go:593] Started "pv-protection"
I1206 10:27:44.406064       1 pv_protection_controller.go:79] Starting PV protection controller
I1206 10:27:44.406070       1 shared_informer.go:255] Waiting for caches to sync for PV protection
I1206 10:27:44.425219       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for jobs.batch
I1206 10:27:44.425249       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for poddisruptionbudgets.policy
I1206 10:27:44.425273       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for applicationstatuses.nci.viavisolutions.com
I1206 10:27:44.425315       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for rolebindings.rbac.authorization.k8s.io
I1206 10:27:44.425333       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for limitranges
I1206 10:27:44.425351       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for replicasets.apps
I1206 10:27:44.425374       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for networkpolicies.networking.k8s.io
I1206 10:27:44.425390       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for podtemplates
I1206 10:27:44.425407       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for ingresses.networking.k8s.io
I1206 10:27:44.425434       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for events.events.k8s.io
I1206 10:27:44.425451       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for applicationinterfaces.nci.viavisolutions.com
W1206 10:27:44.425462       1 shared_informer.go:533] resyncPeriod 21h20m53.927800953s is smaller than resyncCheckPeriod 22h50m50.749923827s and the informer has already started. Changing it to 22h50m50.749923827s
I1206 10:27:44.425569       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for kafkatopics.kafka.strimzi.io
I1206 10:27:44.425592       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for buckets.minio.nci.viavisolutions.com
I1206 10:27:44.425612       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for endpointslices.discovery.k8s.io
I1206 10:27:44.425626       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for elasticcurators.nci.viavisolutions.com
I1206 10:27:44.425638       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for deployments.apps
I1206 10:27:44.425657       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for csistoragecapacities.storage.k8s.io
I1206 10:27:44.425671       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for leases.coordination.k8s.io
I1206 10:27:44.425686       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for kibanadashboards.nci.viavisolutions.com
I1206 10:27:44.425702       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for instances.vertica.nci.viavisolutions.com
I1206 10:27:44.425716       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for kafkastreams.vertica.nci.viavisolutions.com
I1206 10:27:44.425734       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for endpoints
I1206 10:27:44.425754       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for sparkapplications.sparkoperator.k8s.io
I1206 10:27:44.425771       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for scheduledsparkapplications.sparkoperator.k8s.io
I1206 10:27:44.425791       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for statefulsets.apps
I1206 10:27:44.425811       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for cronjobs.batch
I1206 10:27:44.425826       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for georedundancyactions.georedundancy.nci.viavisolutions.com
I1206 10:27:44.425856       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for daemonsets.apps
I1206 10:27:44.425870       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for horizontalpodautoscalers.autoscaling
I1206 10:27:44.425887       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for roles.rbac.authorization.k8s.io
I1206 10:27:44.425899       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for georedundancystates.georedundancy.nci.viavisolutions.com
W1206 10:27:44.425907       1 shared_informer.go:533] resyncPeriod 20h28m32.85967644s is smaller than resyncCheckPeriod 22h50m50.749923827s and the informer has already started. Changing it to 22h50m50.749923827s
I1206 10:27:44.425966       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for serviceaccounts
I1206 10:27:44.425979       1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for controllerrevisions.apps
I1206 10:27:44.425992       1 controllermanager.go:593] Started "resourcequota"
I1206 10:27:44.426023       1 resource_quota_controller.go:273] Starting resource quota controller
I1206 10:27:44.426034       1 shared_informer.go:255] Waiting for caches to sync for resource quota
I1206 10:27:44.426046       1 resource_quota_monitor.go:308] QuotaMonitor running
I1206 10:27:44.427181       1 node_ipam_controller.go:91] Sending events to api server.
I1206 10:27:44.460508       1 shared_informer.go:262] Caches are synced for tokens
E1206 10:27:47.826889       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.10.10.205:10443: connect: connection refused
E1206 10:27:49.827768       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.10.10.205:10443: connect: connection refused
E1206 10:27:51.827531       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.10.10.205:10443: connect: connection refused
E1206 10:27:53.828133       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.10.10.205:10443: connect: connection refused
W1206 10:27:54.428208       1 client_builder_dynamic.go:197] get or create service account failed: Get "https://10.10.10.205:10443/api/v1/namespaces/kube-system/serviceaccounts/node-controller": dial tcp 10.10.10.205:10443: connect: connection refused
W1206 10:27:54.929605       1 client_builder_dynamic.go:197] get or create service account failed: Get "https://10.10.10.205:10443/api/v1/namespaces/kube-system/serviceaccounts/node-controller": dial tcp 10.10.10.205:10443: connect: connection refused
E1206 10:27:55.827843       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.10.205:10443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.10.10.205:10443: connect: connection refused
W1206 10:27:55.931245       1 client_builder_dynamic.go:197] get or create service account failed: Get "https://10.10.10.205:10443/api/v1/namespaces/kube-system/serviceaccounts/node-controller": dial tcp 10.10.10.205:10443: connect: connection refused
E1206 10:27:57.827523       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
I1206 10:27:57.827606       1 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
E1206 10:27:57.827650       1 controllermanager.go:284] "leaderelection lost"
E1206 10:27:57.827668       1 shared_informer.go:258] unable to sync caches for ReplicationController
I1206 10:27:57.827689       1 replica_set.go:209] Shutting down replicationcontroller controller
neolit123 commented 1 year ago

docker ps cri-dockerd 0.2.6

as you probably know 1.24 dropped docker support. https://kubernetes.io/blog/2022/02/17/dockershim-faq/ unfortunately cri-dockerd is not recommended or well tested by k8s as a whole. from my pov its not a well maintained project, so you could be hitting incompatibilities.

minikube have some docker tests but vanilla kubeadm doesn't

sounds like a kubelet / container runtime problem and not a kubeadm one. try searching for etcd in kubelet logs.

neolit123 commented 1 year ago

note- testing our support bot command

/support

github-actions[bot] commented 1 year ago

Hello, @raz3k :robot: :wave:

You seem to have troubles using Kubernetes and kubeadm. Note that our issue trackers should not be used for providing support to users. There are special channels for that purpose.

Please see: