Closed huaqing1994 closed 1 year ago
This issue is currently awaiting triage.
If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Hi @jelmersnoeck do you have any insight on the root cause for PR :seedling: Add configurable etcd call timeout by jelmersnoeck · Pull Request #7841 · kubernetes-sigs/cluster-api
@jessehu we haven't root caused this yet, but saw the exact same behavior.
For more insight and for being able to take a further look into this issue it would be super helpful to have more information.
Some first ideas where we could take a look at is:
etcdctl member list
)kube-system
of the provisioned clusterthx @chrischdi !
Here's more information about an issue:
```yaml apiVersion: controlplane.cluster.x-k8s.io/v1beta1 kind: KubeadmControlPlane metadata: creationTimestamp: "2023-02-01T22:11:00Z" finalizers: - kubeadm.controlplane.cluster.x-k8s.io generation: 2 labels: cluster.x-k8s.io/cluster-name: sks-e2e-test-owk3ls-k8s-upgrade managedFields: - apiVersion: controlplane.cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: {} v:"kubeadm.controlplane.cluster.x-k8s.io": {} f:labels: .: {} f:cluster.x-k8s.io/cluster-name: {} f:ownerReferences: .: {} k:{"uid":"697e015e-acfc-41ef-b29d-b87397eb0928"}: {} manager: manager operation: Update time: "2023-02-01T22:11:29Z" - apiVersion: controlplane.cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:kubeadmConfigSpec: .: {} f:clusterConfiguration: .: {} f:apiServer: .: {} f:extraArgs: .: {} f:admission-control-config-file: {} f:audit-log-maxage: {} f:audit-log-maxbackup: {} f:audit-log-maxsize: {} f:audit-log-path: {} f:audit-policy-file: {} f:enable-admission-plugins: {} f:profiling: {} f:request-timeout: {} f:tls-cipher-suites: {} f:extraVolumes: {} f:clusterName: {} f:controllerManager: .: {} f:extraArgs: .: {} f:allocate-node-cidrs: {} f:bind-address: {} f:profiling: {} f:terminated-pod-gc-threshold: {} f:dns: .: {} f:imageRepository: {} f:etcd: {} f:imageRepository: {} f:networking: {} f:scheduler: .: {} f:extraArgs: .: {} f:bind-address: {} f:profiling: {} f:files: {} f:initConfiguration: .: {} f:localAPIEndpoint: {} f:nodeRegistration: .: {} f:kubeletExtraArgs: .: {} f:event-qps: {} f:pod-infra-container-image: {} f:protect-kernel-defaults: {} f:tls-cipher-suites: {} f:name: {} f:joinConfiguration: .: {} f:discovery: {} f:nodeRegistration: .: {} f:kubeletExtraArgs: .: {} f:event-qps: {} f:pod-infra-container-image: {} f:protect-kernel-defaults: {} f:tls-cipher-suites: {} f:name: {} f:preKubeadmCommands: {} f:useExperimentalRetryJoin: {} f:users: {} f:machineTemplate: .: {} f:infrastructureRef: {} f:nodeDrainTimeout: {} f:replicas: {} f:rolloutStrategy: .: {} f:rollingUpdate: .: {} f:maxSurge: {} f:type: {} f:version: {} manager: sks-manager operation: Update time: "2023-02-01T22:22:40Z" - apiVersion: controlplane.cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:conditions: {} f:initialized: {} f:observedGeneration: {} f:ready: {} f:readyReplicas: {} f:replicas: {} f:selector: {} f:unavailableReplicas: {} f:updatedReplicas: {} f:version: {} manager: manager operation: Update subresource: status time: "2023-02-01T22:25:33Z" name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz ownerReferences: - apiVersion: cluster.x-k8s.io/v1beta1 blockOwnerDeletion: true controller: true kind: Cluster name: sks-e2e-test-owk3ls-k8s-upgrade uid: 697e015e-acfc-41ef-b29d-b87397eb0928 resourceVersion: "6649" uid: 07081de2-ab3b-4aa0-b2d9-99f248d74810 spec: kubeadmConfigSpec: clusterConfiguration: apiServer: extraArgs: admission-control-config-file: /etc/kubernetes/admission.yaml audit-log-maxage: "30" audit-log-maxbackup: "10" audit-log-maxsize: "100" audit-log-path: /var/log/apiserver/audit.log audit-policy-file: /etc/kubernetes/auditpolicy.yaml enable-admission-plugins: AlwaysPullImages,EventRateLimit profiling: "false" request-timeout: 300s tls-cipher-suites: TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_3DES_EDE_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384 extraVolumes: - hostPath: /var/log/apiserver mountPath: /var/log/apiserver name: apiserver-log pathType: DirectoryOrCreate - hostPath: /etc/kubernetes/admission.yaml mountPath: /etc/kubernetes/admission.yaml name: admission-config pathType: FileOrCreate readOnly: true - hostPath: /etc/kubernetes/auditpolicy.yaml mountPath: /etc/kubernetes/auditpolicy.yaml name: audit-policy pathType: FileOrCreate readOnly: true clusterName: sks-e2e-test-owk3ls-k8s-upgrade controllerManager: extraArgs: allocate-node-cidrs: "false" bind-address: 0.0.0.0 profiling: "false" terminated-pod-gc-threshold: "10" dns: imageRepository: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/coredns etcd: {} imageRepository: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191 networking: {} scheduler: extraArgs: bind-address: 0.0.0.0 profiling: "false" files: - content: | apiVersion: v1 kind: Pod metadata: creationTimestamp: null name: kube-vip namespace: kube-system spec: containers: - args: - manager env: - name: cp_enable value: "true" - name: vip_interface value: ens4 - name: address value: 10.255.24.5 - name: port value: "6443" - name: vip_arp value: "true" - name: vip_leaderelection value: "true" - name: vip_leaseduration value: "15" - name: vip_renewdeadline value: "10" - name: vip_retryperiod value: "2" image: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/kube-vip/kube-vip:v0.5.7 imagePullPolicy: IfNotPresent name: kube-vip resources: {} securityContext: capabilities: add: - NET_ADMIN - NET_RAW volumeMounts: - mountPath: /etc/kubernetes/admin.conf name: kubeconfig hostAliases: - hostnames: - kubernetes ip: 127.0.0.1 hostNetwork: true volumes: - hostPath: path: /etc/kubernetes/admin.conf type: FileOrCreate name: kubeconfig status: {} owner: root:root path: /etc/kubernetes/manifests/kube-vip.yaml - content: | apiVersion: apiserver.config.k8s.io/v1 kind: AdmissionConfiguration plugins: - name: EventRateLimit configuration: apiVersion: eventratelimit.admission.k8s.io/v1alpha1 kind: Configuration limits: - type: Server burst: 20000 qps: 5000 owner: root:root path: /etc/kubernetes/admission.yaml - content: | apiVersion: audit.k8s.io/v1 kind: Policy rules: - level: None userGroups: - system:nodes - level: None users: - system:kube-scheduler - system:volume-scheduler - system:kube-controller-manager - level: None nonResourceURLs: - /healthz* - /version - /swagger* - level: Metadata resources: - resources: ["secrets", "configmaps", "tokenreviews"] - level: Metadata omitStages: - RequestReceived resources: - resources: ["pods", "deployments"] owner: root:root path: /etc/kubernetes/auditpolicy.yaml format: cloud-config initConfiguration: localAPIEndpoint: {} nodeRegistration: kubeletExtraArgs: event-qps: "0" pod-infra-container-image: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/pause:3.6 protect-kernel-defaults: "true" tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 name: '{{ ds.meta_data.hostname }}' joinConfiguration: discovery: {} nodeRegistration: kubeletExtraArgs: event-qps: "0" pod-infra-container-image: registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/pause:3.6 protect-kernel-defaults: "true" tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 name: '{{ ds.meta_data.hostname }}' preKubeadmCommands: - if [ -d /etc/sysconfig/network-scripts/ ]; then - echo DEFROUTE=no | tee -a /etc/sysconfig/network-scripts/ifcfg-* > /dev/null - echo PEERDNS=no | tee -a /etc/sysconfig/network-scripts/ifcfg-* > /dev/null - sed -i '/DEFROUTE\|PEERDNS/d' /etc/sysconfig/network-scripts/ifcfg-ens4 - nmcli connection reload && nmcli networking off && nmcli networking on - fi - '[ -f "/run/kubeadm/kubeadm.yaml" ] && printf "\n---\napiVersion: kubeproxy.config.k8s.io/v1alpha1\nkind: KubeProxyConfiguration\nmetricsBindAddress: 0.0.0.0:10249\n" >> /run/kubeadm/kubeadm.yaml' - hostname "{{ ds.meta_data.hostname }}" - echo "::1 ipv6-localhost ipv6-loopback" >/etc/hosts - echo "127.0.0.1 localhost" >>/etc/hosts - echo "127.0.0.1 {{ ds.meta_data.hostname }}" >>/etc/hosts - echo "{{ ds.meta_data.hostname }}" >/etc/hostname - sed -i 's|".*/pause.*"|"registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/pause:3.6"|' /etc/containerd/config.toml - systemctl restart containerd - while ! systemctl status containerd; do sleep 1; done useExperimentalRetryJoin: true users: - lockPassword: false name: smartx passwd: $6$fbz1r5cI$ISJj4xg/DSGLlsIHYWFyuhrSkAtX76AH7RTRTs7h/WFbhVazjq51zNQGsuWYu1JsF1tfYCLCbo9xD3Nr8UY5w0 sudo: ALL=(ALL) NOPASSWD:ALL machineTemplate: infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: ElfMachineTemplate name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-v1-23-14-c7w63y namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz metadata: {} nodeDrainTimeout: 5m0s replicas: 3 rolloutStrategy: rollingUpdate: maxSurge: 1 type: RollingUpdate version: v1.23.14 status: conditions: - lastTransitionTime: "2023-02-01T22:22:46Z" message: Rolling 3 replicas with outdated spec (1 replicas up to date) reason: RollingUpdateInProgress severity: Warning status: "False" type: Ready - lastTransitionTime: "2023-02-01T22:12:59Z" status: "True" type: Available - lastTransitionTime: "2023-02-01T22:11:31Z" status: "True" type: CertificatesAvailable - lastTransitionTime: "2023-02-01T22:25:33Z" status: "True" type: ControlPlaneComponentsHealthy - lastTransitionTime: "2023-02-01T22:25:33Z" status: "True" type: EtcdClusterHealthy - lastTransitionTime: "2023-02-01T22:17:30Z" status: "True" type: MachinesCreated - lastTransitionTime: "2023-02-01T22:25:09Z" status: "True" type: MachinesReady - lastTransitionTime: "2023-02-01T22:22:46Z" message: Rolling 3 replicas with outdated spec (1 replicas up to date) reason: RollingUpdateInProgress severity: Warning status: "False" type: MachinesSpecUpToDate - lastTransitionTime: "2023-02-01T22:22:43Z" message: Scaling down control plane to 3 replicas (actual 4) reason: ScalingDown severity: Warning status: "False" type: Resized initialized: true observedGeneration: 2 ready: true readyReplicas: 3 replicas: 4 selector: cluster.x-k8s.io/cluster-name=sks-e2e-test-owk3ls-k8s-upgrade,cluster.x-k8s.io/control-plane unavailableReplicas: 1 updatedReplicas: 1 version: v1.22.16 ```
low version Machine1: ```yaml apiVersion: cluster.x-k8s.io/v1beta1 kind: Machine metadata: annotations: controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: '{"etcd":{},"networking":{},"apiServer":{"extraArgs":{"admission-control-config-file":"/etc/kubernetes/admission.yaml","audit-log-maxage":"30","audit-log-maxbackup":"10","audit-log-maxsize":"100","audit-log-path":"/var/log/apiserver/audit.log","audit-policy-file":"/etc/kubernetes/auditpolicy.yaml","enable-admission-plugins":"AlwaysPullImages,EventRateLimit","profiling":"false","request-timeout":"300s","tls-cipher-suites":"TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_3DES_EDE_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384"},"extraVolumes":[{"name":"apiserver-log","hostPath":"/var/log/apiserver","mountPath":"/var/log/apiserver","pathType":"DirectoryOrCreate"},{"name":"admission-config","hostPath":"/etc/kubernetes/admission.yaml","mountPath":"/etc/kubernetes/admission.yaml","readOnly":true,"pathType":"FileOrCreate"},{"name":"audit-policy","hostPath":"/etc/kubernetes/auditpolicy.yaml","mountPath":"/etc/kubernetes/auditpolicy.yaml","readOnly":true,"pathType":"FileOrCreate"}]},"controllerManager":{"extraArgs":{"allocate-node-cidrs":"false","bind-address":"0.0.0.0","profiling":"false","terminated-pod-gc-threshold":"10"}},"scheduler":{"extraArgs":{"bind-address":"0.0.0.0","profiling":"false"}},"dns":{"imageRepository":"registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191/coredns"},"imageRepository":"registry.smtx.io/kubesmart-e2e-test/huhui.test-template-v1.22/build191","clusterName":"sks-e2e-test-owk3ls-k8s-upgrade"}' creationTimestamp: "2023-02-01T22:11:32Z" finalizers: - machine.cluster.x-k8s.io generation: 3 labels: cluster.x-k8s.io/cluster-name: sks-e2e-test-owk3ls-k8s-upgrade cluster.x-k8s.io/control-plane: "" cluster.x-k8s.io/control-plane-name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane managedFields: - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: {} f:finalizers: .: {} v:"machine.cluster.x-k8s.io": {} f:labels: .: {} f:cluster.x-k8s.io/cluster-name: {} f:cluster.x-k8s.io/control-plane: {} f:cluster.x-k8s.io/control-plane-name: {} f:ownerReferences: .: {} k:{"uid":"07081de2-ab3b-4aa0-b2d9-99f248d74810"}: {} f:spec: .: {} f:bootstrap: .: {} f:configRef: {} f:dataSecretName: {} f:clusterName: {} f:infrastructureRef: {} f:nodeDrainTimeout: {} f:providerID: {} f:version: {} manager: manager operation: Update time: "2023-02-01T22:13:56Z" - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:addresses: {} f:bootstrapReady: {} f:conditions: {} f:infrastructureReady: {} f:lastUpdated: {} f:nodeInfo: .: {} f:architecture: {} f:bootID: {} f:containerRuntimeVersion: {} f:kernelVersion: {} f:kubeProxyVersion: {} f:kubeletVersion: {} f:machineID: {} f:operatingSystem: {} f:osImage: {} f:systemUUID: {} f:nodeRef: {} f:observedGeneration: {} f:phase: {} manager: manager operation: Update subresource: status time: "2023-02-01T22:29:05Z" name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-9cg5m namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz ownerReferences: - apiVersion: controlplane.cluster.x-k8s.io/v1beta1 blockOwnerDeletion: true controller: true kind: KubeadmControlPlane name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane uid: 07081de2-ab3b-4aa0-b2d9-99f248d74810 resourceVersion: "7816" uid: b7c17f00-18cd-42e6-910b-cae9a4983f5d spec: bootstrap: configRef: apiVersion: bootstrap.cluster.x-k8s.io/v1beta1 kind: KubeadmConfig name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-hx9ps namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz uid: 973f9b62-6e3b-45fb-9fab-93c18bbdd4a6 dataSecretName: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-hx9ps clusterName: sks-e2e-test-owk3ls-k8s-upgrade infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: ElfMachine name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-qtrrf namespace: sks-e2e-huhui-test-template-v1-22-191-vflqvz uid: 054600bd-e693-4017-ae17-620280cf3abb nodeDeletionTimeout: 10s nodeDrainTimeout: 5m0s providerID: elf://6b4dfb57-e3fb-4d67-bf5c-c5d87d92eb50 version: v1.22.16 status: addresses: - address: 10.255.1.149 type: InternalIP - address: 240.255.0.1 type: InternalIP - address: 10.244.7.146 type: InternalIP - address: 100.64.254.254 type: InternalIP - address: 10.255.35.0 type: InternalIP - address: 100.64.254.254 type: InternalIP - address: 10.255.35.1 type: InternalIP - address: 100.64.254.254 type: InternalIP - address: 10.255.35.9 type: InternalIP - address: 100.64.254.254 type: InternalIP - address: 10.255.35.34 type: InternalIP bootstrapReady: true conditions: - lastTransitionTime: "2023-02-01T22:13:56Z" status: "True" type: Ready - lastTransitionTime: "2023-02-01T22:17:13Z" status: "True" type: APIServerPodHealthy - lastTransitionTime: "2023-02-01T22:11:32Z" status: "True" type: BootstrapReady - lastTransitionTime: "2023-02-01T22:13:57Z" status: "True" type: ControllerManagerPodHealthy - lastTransitionTime: "2023-02-01T22:13:57Z" status: "True" type: EtcdMemberHealthy - lastTransitionTime: "2023-02-01T22:13:57Z" status: "True" type: EtcdPodHealthy - lastTransitionTime: "2023-02-01T22:13:56Z" status: "True" type: InfrastructureReady - lastTransitionTime: "2023-02-01T22:13:56Z" status: "True" type: NodeHealthy - lastTransitionTime: "2023-02-01T22:13:57Z" status: "True" type: SchedulerPodHealthy infrastructureReady: true lastUpdated: "2023-02-01T22:13:56Z" nodeInfo: architecture: amd64 bootID: db973ed3-7a84-480c-a62f-4f0f4d3893e6 containerRuntimeVersion: containerd://1.6.4 kernelVersion: 4.18.0-372.9.1.el8.x86_64 kubeProxyVersion: v1.22.16 kubeletVersion: v1.22.16 machineID: 89e1bbf094cb41519341c3040fbef924 operatingSystem: linux osImage: Rocky Linux 8.6 (Green Obsidian) systemUUID: 89e1bbf0-94cb-4151-9341-c3040fbef924 nodeRef: apiVersion: v1 kind: Node name: sks-e2e-test-owk3ls-k8s-upgrade-controlplane-qtrrf uid: c9e33a7c-e9e3-4fee-9569-70906836fe6c observedGeneration: 3 phase: Running ``` low version Machine2: ```yaml ``` low version Machine3: ```yaml ``` high version Machine: ```yaml ```
``` +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.255.1.149:2379 | 224b96420b3e1b24 | 3.5.0 | 35 MB | true | false | 3 | 260058 | 260058 | | | https://10.255.2.106:2379 | 7dec8de804d9d3c1 | 3.5.5 | 34 MB | false | false | 3 | 260058 | 260058 | | | https://10.255.1.177:2379 | 9aee0fa65b787fc9 | 3.5.0 | 34 MB | false | false | 3 | 260058 | 260058 | | | https://10.255.2.18:2379 | af37f0689bf2cfc4 | 3.5.0 | 34 MB | false | false | 3 | 260058 | 260058 | | +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ```
```
{"level":"info","ts":"2023-02-01T22:25:47.374Z","caller":"traceutil/trace.go:171","msg":"trace[408559093] transaction","detail":"{read_only:false; response_revision:7618; number_of_response:1; }","duration":"119.081979ms","start":"2023-02-01T22:25:47.255Z","end":"2023-02-01T22:25:47.374Z","steps":["trace[408559093] 'process raft request' (duration: 84.44308ms)","trace[408559093] 'compare' (duration: 34.029498ms)"],"step_count":2}
{"level":"info","ts":"2023-02-01T22:25:48.876Z","caller":"etcdserver/server.go:1398","msg":"leadership transfer starting","local-member-id":"224b96420b3e1b24","current-leader-member-id":"224b96420b3e1b24","transferee-member-id":"7dec8de804d9d3c1"}
{"level":"info","ts":"2023-02-01T22:25:48.876Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"224b96420b3e1b24 [term 3] starts to transfer leadership to 7dec8de804d9d3c1"}
{"level":"warn","ts":"2023-02-01T22:25:49.141Z","caller":"etcdserver/server.go:1130","msg":"failed to revoke lease","lease-id":"53c1860f15b1881b","error":"raft proposal dropped"}
{"level":"warn","ts":"2023-02-01T22:25:49.141Z","caller":"etcdserver/server.go:1130","msg":"failed to revoke lease","lease-id":"7fc9860f0e0e45da","error":"raft proposal dropped"}
{"level":"warn","ts":"2023-02-01T22:25:50.278Z","caller":"etcdserver/cluster_util.go:186","msg":"leader found higher-versioned member","local-member-version":"3.5.0","remote-member-id":"7dec8de804d9d3c1","remote-member-version":"3.5.5"}
{"level":"warn","ts":"2023-02-01T22:25:54.281Z","caller":"etcdserver/cluster_util.go:186","msg":"leader found higher-versioned member","local-member-version":"3.5.0","remote-member-id":"7dec8de804d9d3c1","remote-member-version":"3.5.5"}
{"level":"info","ts":"2023-02-01T22:25:54.711Z","caller":"traceutil/trace.go:171","msg":"trace[2126905403] transaction","detail":"{read_only:false; number_of_response:0; response_revision:7680; }","duration":"103.675532ms","start":"2023-02-01T22:25:54.608Z","end":"2023-02-01T22:25:54.711Z","steps":["trace[2126905403] 'process raft request' (duration: 103.470103ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-01T22:25:58.286Z","caller":"etcdserver/cluster_util.go:186","msg":"leader found higher-versioned member","local-member-version":"3.5.0","remote-member-id":"7dec8de804d9d3c1","remote-member-version":"3.5.5"}
{"level":"warn","ts":"2023-02-01T22:25:58.531Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"126.824875ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/pods/kube-system/ecp-agent-s8mdq\" ","response":"range_response_count:1 size:6963"}
{"level":"info","ts":"2023-02-01T22:25:58.531Z","caller":"traceutil/trace.go:171","msg":"trace[689488468] range","detail":"{range_begin:/registry/pods/kube-system/ecp-agent-s8mdq; range_end:; response_count:1; response_revision:7718; }","duration":"126.952099ms","start":"2023-02-01T22:25:58.404Z","end":"2023-02-01T22:25:58.531Z","steps":["trace[689488468] 'agreement among raft nodes before linearized reading' (duration: 126.759114ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-01T22:26:02.290Z","caller":"etcdserver/cluster_util.go:186","msg":"leader found higher-versioned member","local-member-version":"3.5.0","remote-member-id":"7dec8de804d9d3c1","remote-member-version":"3.5.5"}
{"level":"warn","ts":"2023-02-01T22:26:06.294Z","caller":"etcdserver/cluster_util.go:186","msg":"leader found higher-versioned member","local-member-version":"3.5.0","remote-member-id":"7dec8de804d9d3c1","remote-member-version":"3.5.5"}
{"level":"info","ts":"2023-02-01T22:26:06.809Z","caller":"traceutil/trace.go:171","msg":"trace[1814538697] transaction","detail":"{read_only:false; response_revision:7790; number_of_response:1; }","duration":"147.964398ms","start":"2023-02-01T22:26:06.661Z","end":"2023-02-01T22:26:06.809Z","steps":["trace[1814538697] 'process raft request' (duration: 53.737143ms)","trace[1814538697] 'compare' (duration: 94.068255ms)"],"step_count":2}
{"level":"info","ts":"2023-02-01T22:26:08.635Z","caller":"traceutil/trace.go:171","msg":"trace[1507755063] transaction","detail":"{read_only:false; response_revision:7796; number_of_response:1; }","duration":"153.940219ms","start":"2023-02-01T22:26:08.481Z","end":"2023-02-01T22:26:08.635Z","steps":["trace[1507755063] 'process raft request' (duration: 153.7693ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-01T22:26:10.298Z","caller":"etcdserver/cluster_util.go:186","msg":"leader found higher-versioned member","local-member-version":"3.5.0","remote-member-id":"7dec8de804d9d3c1","remote-member-version":"3.5.5"}
{"level":"warn","ts":"2023-02-01T22:26:14.306Z","caller":"etcdserver/cluster_util.go:186","msg":"leader found higher-versioned member","local-member-version":"3.5.0","remote-member-id":"7dec8de804d9d3c1","remote-member-version":"3.5.5"}
{"level":"warn","ts":"2023-02-01T22:26:18.312Z","caller":"etcdserver/cluster_util.go:186","msg":"leader found higher-versioned member","local-member-version":"3.5.0","remote-member-id":"7dec8de804d9d3c1","remote-member-version":"3.5.5"}
{"level":"info","ts":"2023-02-01T22:26:20.410Z","caller":"traceutil/trace.go:171","msg":"trace[1090351540] linearizableReadLoop","detail":"{readStateIndex:9043; appliedIndex:9043; }","duration":"100.675717ms","start":"2023-02-01T22:26:20.309Z","end":"2023-02-01T22:26:20.410Z","steps":["trace[1090351540] 'read index received' (duration: 100.670472ms)","trace[1090351540] 'applied index is now lower than readState.Index' (duration: 3.33µs)"],"step_count":2}
{"level":"info","ts":"2023-02-01T22:26:20.410Z","caller":"traceutil/trace.go:171","msg":"trace[1938834446] transaction","detail":"{read_only:false; response_revision:7883; number_of_response:1; }","duration":"104.937153ms","start":"2023-02-01T22:26:20.305Z","end":"2023-02-01T22:26:20.410Z","steps":["trace[1938834446] 'process raft request' (duration: 104.72741ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-01T22:26:20.410Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"101.021426ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"range_response_count:0 size:5"}
{"level":"info","ts":"2023-02-01T22:26:20.411Z","caller":"traceutil/trace.go:171","msg":"trace[627460088] range","detail":"{range_begin:/registry/health; range_end:; response_count:0; response_revision:7883; }","duration":"101.361459ms","start":"2023-02-01T22:26:20.309Z","end":"2023-02-01T22:26:20.411Z","steps":["trace[627460088] 'agreement among raft nodes before linearized reading' (duration: 100.920503ms)"],"step_count":1}
{"level":"info","ts":"2023-02-01T22:26:20.651Z","caller":"traceutil/trace.go:171","msg":"trace[1878359730] transaction","detail":"{read_only:false; response_revision:7885; number_of_response:1; }","duration":"118.677494ms","start":"2023-02-01T22:26:20.532Z","end":"2023-02-01T22:26:20.651Z","steps":["trace[1878359730] 'process raft request' (duration: 118.340079ms)"],"step_count":1}
{"level":"info","ts":"2023-02-01T22:26:20.910Z","caller":"traceutil/trace.go:171","msg":"trace[727615520] transaction","detail":"{read_only:false; response_revision:7887; number_of_response:1; }","duration":"246.831763ms","start":"2023-02-01T22:26:20.663Z","end":"2023-02-01T22:26:20.910Z","steps":["trace[727615520] 'process raft request' (duration: 246.608132ms)"],"step_count":1}
{"level":"info","ts":"2023-02-01T22:26:21.010Z","caller":"traceutil/trace.go:171","msg":"trace[1365621629] transaction","detail":"{read_only:false; response_revision:7888; number_of_response:1; }","duration":"224.62174ms","start":"2023-02-01T22:26:20.786Z","end":"2023-02-01T22:26:21.010Z","steps":["trace[1365621629] 'process raft request' (duration: 224.409399ms)"],"step_count":1}
{"level":"info","ts":"2023-02-01T22:26:21.264Z","caller":"traceutil/trace.go:171","msg":"trace[927596763] linearizableReadLoop","detail":"{readStateIndex:9050; appliedIndex:9050; }","duration":"112.974198ms","start":"2023-02-01T22:26:21.151Z","end":"2023-02-01T22:26:21.264Z","steps":["trace[927596763] 'read index received' (duration: 112.966154ms)","trace[927596763] 'applied index is now lower than readState.Index' (duration: 6.806µs)"],"step_count":2}
{"level":"warn","ts":"2023-02-01T22:26:21.264Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"113.210732ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/configmaps/sks-system-logging/92b05146.kubesmart.smtx.io\" ","response":"range_response_count:1 size:626"}
{"level":"info","ts":"2023-02-01T22:26:21.264Z","caller":"traceutil/trace.go:171","msg":"trace[321391868] range","detail":"{range_begin:/registry/configmaps/sks-system-logging/92b05146.kubesmart.smtx.io; range_end:; response_count:1; response_revision:7889; }","duration":"113.319894ms","start":"2023-02-01T22:26:21.151Z","end":"2023-02-01T22:26:21.264Z","steps":["trace[321391868] 'agreement among raft nodes before linearized reading' (duration: 113.135439ms)"],"step_count":1}
{"level":"info","ts":"2023-02-01T22:26:21.264Z","caller":"traceutil/trace.go:171","msg":"trace[1170037460] transaction","detail":"{read_only:false; response_revision:7890; number_of_response:1; }","duration":"136.68946ms","start":"2023-02-01T22:26:21.127Z","end":"2023-02-01T22:26:21.264Z","steps":["trace[1170037460] 'process raft request' (duration: 136.488833ms)"],"step_count":1}
{"level":"info","ts":"2023-02-01T22:26:21.420Z","caller":"traceutil/trace.go:171","msg":"trace[903375563] transaction","detail":"{read_only:false; response_revision:7891; number_of_response:1; }","duration":"149.366209ms","start":"2023-02-01T22:26:21.271Z","end":"2023-02-01T22:26:21.420Z","steps":["trace[903375563] 'process raft request' (duration: 108.144398ms)","trace[903375563] 'compare' (duration: 41.082086ms)"],"step_count":2}
{"level":"warn","ts":"2023-02-01T22:26:21.614Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"177.908165ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/kube-system/snapshot-controller-leader\" ","response":"range_response_count:1 size:500"}
{"level":"info","ts":"2023-02-01T22:26:21.615Z","caller":"traceutil/trace.go:171","msg":"trace[414573502] range","detail":"{range_begin:/registry/leases/kube-system/snapshot-controller-leader; range_end:; response_count:1; response_revision:7891; }","duration":"178.036965ms","start":"2023-02-01T22:26:21.436Z","end":"2023-02-01T22:26:21.615Z","steps":["trace[414573502] 'agreement among raft nodes before linearized reading' (duration: 95.287081ms)","trace[414573502] 'range keys from in-memory index tree' (duration: 82.539481ms)"],"step_count":2}
{"level":"warn","ts":"2023-02-01T22:26:21.615Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"120.832593ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/sks-system/external-resizer-com-smartx-zbs-csi-driver\" ","response":"range_response_count:1 size:531"}
{"level":"info","ts":"2023-02-01T22:26:21.615Z","caller":"traceutil/trace.go:171","msg":"trace[534010467] range","detail":"{range_begin:/registry/leases/sks-system/external-resizer-com-smartx-zbs-csi-driver; range_end:; response_count:1; response_revision:7893; }","duration":"120.897173ms","start":"2023-02-01T22:26:21.494Z","end":"2023-02-01T22:26:21.615Z","steps":["trace[534010467] 'agreement among raft nodes before linearized reading' (duration: 120.783144ms)"],"step_count":1}
{"level":"info","ts":"2023-02-01T22:26:21.615Z","caller":"traceutil/trace.go:171","msg":"trace[947492628] transaction","detail":"{read_only:false; response_revision:7892; number_of_response:1; }","duration":"342.031603ms","start":"2023-02-01T22:26:21.273Z","end":"2023-02-01T22:26:21.615Z","steps":["trace[947492628] 'process raft request' (duration: 259.023409ms)","trace[947492628] 'compare' (duration: 82.41945ms)"],"step_count":2}
{"level":"info","ts":"2023-02-01T22:26:21.615Z","caller":"traceutil/trace.go:171","msg":"trace[565397956] transaction","detail":"{read_only:false; response_revision:7893; number_of_response:1; }","duration":"341.157023ms","start":"2023-02-01T22:26:21.274Z","end":"2023-02-01T22:26:21.615Z","steps":["trace[565397956] 'process raft request' (duration: 340.737783ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-01T22:26:21.615Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-02-01T22:26:21.274Z","time spent":"341.225415ms","remote":"127.0.0.1:40136","response type":"/etcdserverpb.KV/Txn","request count":1,"request size":587,"response count":0,"response size":39,"request content":"compare:
I see that the etcd leader has received the request of transfering leadership. But I don't see any relevant logs of executing about it. In fact, leadership has not been transfered. If the transfer fails this time, why does kcp-controller call MoveLeader() not return failure, but get stuck? I'm wondering about it.
For finding the real root-cause I think it would be required to go to the etcd level and try to debug it on that side (e.g. see if the leader could get moved manually or so, check if the disk speed is enough to run etcd or this could be a cause).
From CAPI perspective: I think #7841 could already improve on this particular issue. However, it would be interesting if it solves this issue by retrying the leader move (which would happen via #7841 if I understood it correctly, at least the KCP controller would surface a timeout error and retry).
As an FYI, going back through some of these occurrences of this, we've seen etcd client calls block in multiple spots. The initial issue we ran into was on the etcd members call.
/triage needs-information @jelmersnoeck any update on this investigation?
@fabriziopandini we didn't find anything pointing us to a concrete issue and haven't investigated this further. With the etcd client timeouts in place, this hasn't come up for us again.
I'm not sure if @huaqing1994 has found any further leads?
thanks for the feedback /close
we can eventually re-open or open a new issue if new evidence are collected
@fabriziopandini: Closing this issue.
What steps did you take and what happened: I've found a occasionally issue in our CI environment(3 times a month): We use CAPI to create and upgrade a worker cluster(3 master + 3 worker). I've found that the upgrade occasionally gets stuck after the first new control plane node is created successfully. After the problem occurs, kcp-controller no longer prints any logs about this cluster. The goroutine stack trace of kcp-controller shows that a certain reconcile has been blocked in ETCD MoveLeader().
I found a related PR #7841 which fixes kcp-controller getting stuck. But I want to know the reason and mechanism of this problem, has anyone encountered it?
Environment:
kubectl version
): 1.25.0/etc/os-release
): CentOS 8/kind bug