openyurtio / openyurt

OpenYurt - Extending your native Kubernetes to edge(project under CNCF)
https://openyurt.io
Apache License 2.0
1.69k stars 398 forks source link

[BUG]could not upgrade static pod, timeout waiting for static pod kube-system/yurt-hub-xxx to be running #1986

Closed Ssspade closed 4 months ago

Ssspade commented 5 months ago

All other Pods are running, with only yss-upgrade-worker-yurt-hub.

default nginx-node2-7cb6c5fdbc-7zzq6 1/1 Running 0 32m
kube-flannel kube-flannel-ds-cj8z2 1/1 Running 0 32m
kube-system coredns-wbpwh 1/1 Running 0 32m
kube-system kube-proxy-86g8d 1/1 Running 0 32m
kube-system raven-agent-ds-5qhxk 1/1 Running 2 (52m ago) 98m
kube-system yss-upgrade-worker-yurt-hub-node2-65d4cdf959 0/1 Error 0
kube-system yurt-hub-node2 1/1 Running 0 31m


kube-system yss-upgrade-worker-yurt-hub-node2-65d4cdf959 0/1 Error 0

logs: I0319 09:28:14.178337 17976 upgrade.go:35] FLAG: --hash="65d4cdf959" I0319 09:28:14.178637 17976 upgrade.go:35] FLAG: --help="false" I0319 09:28:14.178662 17976 upgrade.go:35] FLAG: --kubeconfig="" I0319 09:28:14.178673 17976 upgrade.go:35] FLAG: --manifest="yurthub" I0319 09:28:14.178681 17976 upgrade.go:35] FLAG: --mode="AdvancedRollingUpdate" I0319 09:28:14.178690 17976 upgrade.go:35] FLAG: --name="yurt-hub-node2" I0319 09:28:14.178698 17976 upgrade.go:35] FLAG: --namespace="kube-system" I0319 09:28:14.178706 17976 upgrade.go:35] FLAG: --timeout="2m0s" I0319 09:28:14.179044 17976 upgrade.go:88] Create upgrade space success I0319 09:28:14.179247 17976 upgrade.go:106] Auto prepare upgrade manifest success I0319 09:28:14.179395 17976 upgrade.go:112] Auto upgrade backupManifest success I0319 09:28:14.179672 17976 upgrade.go:118] Auto upgrade replaceManifest success I0319 09:28:14.179721 17976 util.go:75] WaitForPodRunning namespace is kube-system, name is yurt-hub-node2 F0319 09:30:14.652184 17976 upgrade.go:48] could not upgrade static pod, timeout waiting for static pod kube-system/yurt-hub-node2 to be running goroutine 1 [running]: k8s.io/klog/v2.stacks(0x1) /go/pkg/mod/k8s.io/klog/v2@v2.9.0/klog.go:1026 +0x90 k8s.io/klog/v2.(loggingT).output(0x1be27e0, 0x3, {0x0, 0x0}, 0x400024c930, 0x0, {0x14bebdd?, 0x40001140f0?}, 0x0?, 0x0) /go/pkg/mod/k8s.io/klog/v2@v2.9.0/klog.go:975 +0x5f4 k8s.io/klog/v2.(loggingT).printf(0x400024fb80?, 0x109bbf0?, {0x0, 0x0}, {0x0, 0x0}, {0xfe1316, 0x20}, {0x40001140f0, 0x1, ...}) /go/pkg/mod/k8s.io/klog/v2@v2.9.0/klog.go:753 +0x184 k8s.io/klog/v2.Fatalf(...) /go/pkg/mod/k8s.io/klog/v2@v2.9.0/klog.go:1514 github.com/openyurtio/openyurt/cmd/yurt-node-servant/static-pod-upgrade.NewUpgradeCmd.func1(0x400007ef00?, {0x4000516690?, 0x5?, 0x5?}) /build/cmd/yurt-node-servant/static-pod-upgrade/upgrade.go:48 +0x258 github.com/spf13/cobra.(Command).execute(0x400007ef00, {0x4000516640, 0x5, 0x5}) /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:944 +0x5f4 github.com/spf13/cobra.(Command).ExecuteC(0x400007e000) /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x368 github.com/spf13/cobra.(*Command).Execute(...) /go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992 main.main() /build/cmd/yurt-node-servant/node-servant.go:53 +0x2dc

goroutine 34 [chan receive, 2 minutes]: k8s.io/klog/v2.(*loggingT).flushDaemon(0x0?) /go/pkg/mod/k8s.io/klog/v2@v2.9.0/klog.go:1169 +0x60 created by k8s.io/klog/v2.init.0 /go/pkg/mod/k8s.io/klog/v2@v2.9.0/klog.go:420 +0x150

Environment:

others

/kind bug

rambohe-ch commented 5 months ago

@Ssspade Thanks for raising the issue.

I0319 09:28:14.179044 17976 upgrade.go:88] Create upgrade space success I0319 09:28:14.179247 17976 upgrade.go:106] Auto prepare upgrade manifest success I0319 09:28:14.179395 17976 upgrade.go:112] Auto upgrade backupManifest success I0319 09:28:14.179672 17976 upgrade.go:118] Auto upgrade replaceManifest success I0319 09:28:14.179721 17976 util.go:75] WaitForPodRunning namespace is kube-system, name is yurt-hub-node2 F0319 09:30:14.652184 17976 upgrade.go:48] could not upgrade static pod, timeout waiting for static pod kube-system/yurt-hub-node2 to be running

As the above logs, manifest file of YurtHub static pod has been replaced, but YurtHub static pod failed to restart. so would you like to check the following details:

Ssspade commented 5 months ago

Thanks for your attention. I would like to ask, what is the purpose of "yss-upgrade-worker-yurt-hub-xxx"?

When I first modified the yurthub.yaml file and placed it in the directory /etc/kubernetes/manifests/, it started to create and run. However, due to incorrect modifications in my yurthub.yaml file, yurt-hub-node2 could not run properly. Therefore, I deleted the yurthub.yaml file, re-modified it, and placed it back in the directory /etc/kubernetes/manifests/. After doing so, yurt-hub-node2 ran successfully, but "yss-upgrade-worker-" encountered an error.

I manually deleted the pod "yss-upgrade-worker-" to restart it, but upon checking the kubelet logs, I did not find any relevant logs, neither on the node nor on the master.

Ssspade commented 5 months ago

Sorry, my previous reproduction method was a bit flawed. I deleted yurthub.yaml and recopied it to the directory /etc/kubernetes/manifests/. The logs for yurt-hub-xxx are as follows:

I0320 09:32:53.103470       1 util.go:289] start proxying: get /apis/apps.openyurt.io/v1beta1/nodepools?limit=500&resourceVersion=0, in flight requests: 1
I0320 09:32:53.103765       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca-bundle::/var/lib/yurthub/pki/ca.crt" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2024-03-12 08:43:04 +0000 UTC to 2034-03-10 08:43:04 +0000 UTC (now=2024-03-20 09:32:53.10367724 +0000 UTC))"
I0320 09:32:53.104143       1 util.go:289] start proxying: get /api/v1/services?limit=500&resourceVersion=0, in flight requests: 2
I0320 09:32:53.104455       1 tlsconfig.go:200] "Loaded serving cert" certName="serving-cert::/var/lib/yurthub/pki/yurthub-server-current.pem::/var/lib/yurthub/pki/yurthub-server-current.pem" certDetail="\"system:node:node1\" [serving] groups=[system:nodes] validServingFor=[169.254.2.1,127.0.0.1] issuer=\"kubernetes\" (2024-03-20 07:46:26 +0000 UTC to 2025-03-20 07:46:26 +0000 UTC (now=2024-03-20 09:32:53.10435406 +0000 UTC))"
I0320 09:32:53.104524       1 secure_serving.go:200] Serving securely on 169.254.2.1:10268
I0320 09:32:53.104601       1 dynamic_serving_content.go:129] "Starting controller" name="serving-cert::/var/lib/yurthub/pki/yurthub-server-current.pem::/var/lib/yurthub/pki/yurthub-server-current.pem"
I0320 09:32:53.104678       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0320 09:32:53.104699       1 dynamic_cafile_content.go:155] "Starting controller" name="client-ca-bundle::/var/lib/yurthub/pki/ca.crt"
E0320 09:32:53.105635       1 gc.go:181] could not list keys for kubelet events, specified key is not found
E0320 09:32:53.105669       1 gc.go:181] could not list keys for kube-proxy events, specified key is not found
I0320 09:32:53.106652       1 util.go:289] start proxying: get /api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dyurt-hub-cfg&limit=500&resourceVersion=0, in flight requests: 3
I0320 09:32:53.108302       1 util.go:248] yurthub list configmaps: /api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dyurt-hub-cfg&limit=500&resourceVersion=0 with status code 200, spent 1.50208ms
I0320 09:32:53.110921       1 approver.go:191] current filter setting: map[coredns/endpoints/list:map[servicetopology:{}] coredns/endpoints/watch:map[servicetopology:{}] coredns/endpointslices/list:map[servicetopology:{}] coredns/endpointslices/watch:map[servicetopology:{}] kube-proxy/endpoints/list:map[servicetopology:{}] kube-proxy/endpoints/watch:map[servicetopology:{}] kube-proxy/endpointslices/list:map[servicetopology:{}] kube-proxy/endpointslices/watch:map[servicetopology:{}] kube-proxy/services/list:map[discardcloudservice:{} nodeportisolation:{}] kube-proxy/services/watch:map[discardcloudservice:{} nodeportisolation:{}] kubelet/configmaps/get:map[inclusterconfig:{}] kubelet/configmaps/list:map[inclusterconfig:{}] kubelet/configmaps/watch:map[inclusterconfig:{}] kubelet/pods/list:map[hostnetworkpropagation:{}] kubelet/pods/watch:map[hostnetworkpropagation:{}] kubelet/services/list:map[masterservice:{}] kubelet/services/watch:map[masterservice:{}] nginx-ingress-controller/endpoints/list:map[servicetopology:{}] nginx-ingress-controller/endpoints/watch:map[servicetopology:{}] nginx-ingress-controller/endpointslices/list:map[servicetopology:{}] nginx-ingress-controller/endpointslices/watch:map[servicetopology:{}]] after add
I0320 09:32:53.111522       1 util.go:289] start proxying: get /api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dyurt-hub-cfg&resourceVersion=476411&timeout=5m32s&timeoutSeconds=332&watch=true, in flight requests: 3
I0320 09:32:53.206474       1 util.go:248] yurthub list nodepools: /apis/apps.openyurt.io/v1beta1/nodepools?limit=500&resourceVersion=0 with status code 200, spent 102.901ms
I0320 09:32:53.206547       1 serializer.go:200] schema.GroupVersionResource{Group:"apps.openyurt.io", Version:"v1beta1", Resource:"nodepools"} is not found in client-go runtime scheme
I0320 09:32:53.207393       1 util.go:289] start proxying: get /apis/apps.openyurt.io/v1beta1/nodepools?allowWatchBookmarks=true&resourceVersion=476486&timeoutSeconds=578&watch=true, in flight requests: 3
I0320 09:32:53.208631       1 util.go:248] yurthub list services: /api/v1/services?limit=500&resourceVersion=0 with status code 200, spent 104.39578ms
I0320 09:32:53.212136       1 util.go:289] start proxying: get /api/v1/services?allowWatchBookmarks=true&resourceVersion=476878&timeout=9m56s&timeoutSeconds=596&watch=true, in flight requests: 3
E0320 09:32:53.538807       1 storage_wrapper.go:141] could not list objects for kubelet/pods.v1.core, specified key is not found
E0320 09:32:53.538887       1 ota.go:66] Get pod list failed, specified key is not found
E0320 09:32:58.194780       1 storage_wrapper.go:141] could not list objects for kubelet/pods.v1.core, specified key is not found
E0320 09:32:58.194836       1 ota.go:66] Get pod list failed, specified key is not found
E0320 09:33:03.757531       1 storage_wrapper.go:141] could not list objects for kubelet/pods.v1.core, specified key is not found
E0320 09:33:03.757566       1 ota.go:66] Get pod list failed, specified key is not found
E0320 09:33:08.020620       1 storage_wrapper.go:141] could not list objects for kubelet/pods.v1.core, specified key is not found
E0320 09:33:08.020678       1 ota.go:66] Get pod list failed, specified key is not found
E0320 09:33:13.837857       1 storage_wrapper.go:141] could not list objects for kubelet/pods.v1.core, specified key is not found
E0320 09:33:13.837916       1 ota.go:66] Get pod list failed, specified key is not found
E0320 09:33:18.661483       1 storage_wrapper.go:141] could not list objects for kubelet/pods.v1.core, specified key is not found
E0320 09:33:18.661539       1 ota.go:66] Get pod list failed, specified key is not found
E0320 09:33:23.089416       1 storage_wrapper.go:141] could not list objects for kubelet/pods.v1.core, specified key is not found
E0320 09:33:23.089471       1 ota.go:66] Get pod list failed, specified key is not found
E0320 09:33:28.518469       1 storage_wrapper.go:141] could not list objects for kubelet/pods.v1.core, specified key is not found
E0320 09:33:28.518503       1 ota.go:66] Get pod list failed, specified key is not found

The Edge node kubelet logs are as follows:

Mar 20 09:33:25 node1 kubelet[393]: E0320 09:33:25.510726     393 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "node1": the object has been modified; please apply your changes to the latest version and try again
Mar 20 09:33:31 node1 kubelet[393]: E0320 09:33:31.201961     393 dns.go:157] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is:  114.114.114.114 fdae:7c79:ff46::1"
Mar 20 09:33:35 node1 kubelet[393]: E0320 09:33:35.859105     393 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "node1": the object has been modified; please apply your changes to the latest version and try again
Mar 20 09:33:45 node1 kubelet[393]: E0320 09:33:45.374661     393 dns.go:157] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 114.114.114.114 fdae:7c79:ff46::1"
Mar 20 09:33:47 node1 kubelet[393]: E0320 09:33:47.522655     393 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "node1": the object has been modified; please apply your changes to the latest version and try again
Mar 20 09:33:57 node1 kubelet[393]: E0320 09:33:57.623901     393 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "node1": the object has been modified; please apply your changes to the latest version and try again
Mar 20 09:34:07 node1 kubelet[393]: E0320 09:34:07.951674     393 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "node1": the object has been modified; please apply your changes to the latest version and try again

The normally running yurt-hub is re-created through kubelet after the failure of yss-upgrade-worker.

rambohe-ch commented 5 months ago

@Ssspade It seems that Yurthub component have not worked correctly. would you like to show me the fully Yurthub logs?

Ssspade commented 5 months ago

During the execution of yss-upgrade, the logs for yurt-hub are recorded in the file named "yurt-hub-during-yss.log".

image

yurt-hub-during-yss.log After 2 minutes, if yss-upgrade fails, yurt-hub will restart and its logs will be written to the file named "yurt-hub.log".

image

yurt-hub.log

rambohe-ch commented 5 months ago

During the execution of yss-upgrade, the logs for yurt-hub are recorded in the file named "yurt-hub-during-yss.log". image yurt-hub-during-yss.log After 2 minutes, if yss-upgrade fails, yurt-hub will restart and its logs will be written to the file named "yurt-hub.log". image

yurt-hub.log

@Ssspade It seems that yurt-manager component is not installed correctly, because NodePool CRD is not found in your cluster.

In your logs file, there are some errors about NodePool resource as following:

I0326 02:14:43.771606 1 serializer.go:200] schema.GroupVersionResource{Group:"apps.openyurt.io", Version:"v1beta1", Resource:"nodepools"} is not found in client-go runtime scheme

rambohe-ch commented 5 months ago

@Ssspade Would you like to upload the detail metrics about yurthub component?

you can get the metrics on the node by execute the command: curl http://127.0.0.1:10267/metrics

Ssspade commented 5 months ago

Of course, I'm very willing to. The installation process, I referred to the official websitev1.4。 curl http://127.0.0.1:10267/metrics,logcurl_metric.log

Ssspade commented 5 months ago

During the execution of yss-upgrade, the logs for yurt-hub are recorded in the file named "yurt-hub-during-yss.log". image yurt-hub-during-yss.log After 2 minutes, if yss-upgrade fails, yurt-hub will restart and its logs will be written to the file named "yurt-hub.log". image yurt-hub.log

@Ssspade It seems that yurt-manager component is not installed correctly, because NodePool CRD is not found in your cluster.

In your logs file, there are some errors about NodePool resource as following:

I0326 02:14:43.771606 1 serializer.go:200] schema.GroupVersionResource{Group:"apps.openyurt.io", Version:"v1beta1", Resource:"nodepools"} is not found in client-go runtime scheme

I use the command to view it:

root@master:~# kubectl get crd
NAME                                 CREATED AT
gateways.raven.openyurt.io           2024-03-21T09:29:27Z
nodepools.apps.openyurt.io           2024-03-21T09:29:26Z
platformadmins.iot.openyurt.io       2024-03-21T09:29:26Z
yurtappdaemons.apps.openyurt.io      2024-03-21T09:29:26Z
yurtappoverriders.apps.openyurt.io   2024-03-21T09:29:26Z
yurtappsets.apps.openyurt.io         2024-03-21T09:29:26Z
yurtstaticsets.apps.openyurt.io      2024-03-21T09:29:26Z
rambohe-ch commented 5 months ago

Of course, I'm very willing to. The installation process, I referred to the official websitev1.4。 curl http://127.0.0.1:10267/metrics,log: curl_metric.log

@Ssspade From the metrics of yurthub, only yurthub itself send reqeusts to kube-apiserver through yurthub, and other components have not sent requests to kube-apiserver through yurthub.

node_yurthub_in_flight_requests_collector{client="yurthub",resource="configmaps",subresources="",verb="list"} 0 node_yurthub_in_flight_requests_collector{client="yurthub",resource="configmaps",subresources="",verb="watch"} 1 node_yurthub_in_flight_requests_collector{client="yurthub",resource="nodepools",subresources="",verb="list"} 0 node_yurthub_in_flight_requests_collector{client="yurthub",resource="nodepools",subresources="",verb="watch"} 1 node_yurthub_in_flight_requests_collector{client="yurthub",resource="services",subresources="",verb="list"} 0 node_yurthub_in_flight_requests_collector{client="yurthub",resource="services",subresources="",verb="watch"} 1

Have you used yurtadm join to add node into this cluster? or use kubeadm join command to add node?

rambohe-ch commented 5 months ago

During the execution of yss-upgrade, the logs for yurt-hub are recorded in the file named "yurt-hub-during-yss.log". image yurt-hub-during-yss.log After 2 minutes, if yss-upgrade fails, yurt-hub will restart and its logs will be written to the file named "yurt-hub.log". image yurt-hub.log

@Ssspade It seems that yurt-manager component is not installed correctly, because NodePool CRD is not found in your cluster. In your logs file, there are some errors about NodePool resource as following:

I0326 02:14:43.771606 1 serializer.go:200] schema.GroupVersionResource{Group:"apps.openyurt.io", Version:"v1beta1", Resource:"nodepools"} is not found in client-go runtime scheme

I use the command to view it:

root@master:~# kubectl get crd
NAME                                 CREATED AT
gateways.raven.openyurt.io           2024-03-21T09:29:27Z
nodepools.apps.openyurt.io           2024-03-21T09:29:26Z
platformadmins.iot.openyurt.io       2024-03-21T09:29:26Z
yurtappdaemons.apps.openyurt.io      2024-03-21T09:29:26Z
yurtappoverriders.apps.openyurt.io   2024-03-21T09:29:26Z
yurtappsets.apps.openyurt.io         2024-03-21T09:29:26Z
yurtstaticsets.apps.openyurt.io      2024-03-21T09:29:26Z

@Ssspade yes, from the metrics of yurthub, it seems that you installed yurthub before yurt-manager installation. would you like to tell me how do you add your node into cluster?

Ssspade commented 5 months ago

Of course, I'm very willing to. The installation process, I referred to the official websitev1.4。 curl http://127.0.0.1:10267/metrics,log: curl_metric.log

@Ssspade From the metrics of yurthub, only yurthub itself send reqeusts to kube-apiserver through yurthub, and other components have not sent requests to kube-apiserver through yurthub.

node_yurthub_in_flight_requests_collector{client="yurthub",resource="configmaps",subresources="",verb="list"} 0 node_yurthub_in_flight_requests_collector{client="yurthub",resource="configmaps",subresources="",verb="watch"} 1 node_yurthub_in_flight_requests_collector{client="yurthub",resource="nodepools",subresources="",verb="list"} 0 node_yurthub_in_flight_requests_collector{client="yurthub",resource="nodepools",subresources="",verb="watch"} 1 node_yurthub_in_flight_requests_collector{client="yurthub",resource="services",subresources="",verb="list"} 0 node_yurthub_in_flight_requests_collector{client="yurthub",resource="services",subresources="",verb="watch"} 1

Have you used yurtadm join to add node into this cluster? or use kubeadm join command to add node?

I have used kubeadm join to add node. The way to install openyurt is on Kubernetes. I haven't completed the "Join Nodes-Configure Kubelet (2.3)" step yet.

Ssspade commented 4 months ago

curl http://127.0.0.1:10267/metrics

@rambohe-ch My OpenYurt was installed on top of an existing Kubernetes cluster and node. Firstly, I followed the installation guide to install the master node, and then, according to the node joining guide(2. Install OpenYurt node components), I installed the OpenYurt Node component on the existing Kubernetes nodes.

If I disregard errors from the yss-upgrade component and proceed with the "Join Nodes" step. curl http://127.0.0.1:10267/metrics log is curl-metric.log.

Additionally, my physical environment consists of an x86 master node and ARM64 edge nodes. I'm unsure if this might have any impact.

Ssspade commented 4 months ago

I made the following modifications, and the static pod upgrade function is running normally: (1) Downgraded Kubernetes version: v1.23.0 to v1.22.11 (2) Downgraded Flannel version: latest to v0.18.1 Thank you for your assistance.