KCP reconcile hang the when workload cluster API Server is unreachable

Levi080513 commented 1 year ago

What steps did you take and what happened?

When upgrading KCP, i found that KCP reconcile hang. Even if the annotation is manually updated to trigger reconcile, there is no reconcile log printed.

KCP log

2023-06-29T18:33:38.306571117+08:00 stderr F E0629 10:33:38.300120       1 controller.go:329] "Reconciler error" err="[cannot get remote client to workload cluster: failed to create cluster accessor: error checking if we're running on workload cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": context deadline exceeded, failed to create remote cluster client: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)]" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=4f8f9888-0cbe-4f56-8fe8-2407662dbaa1
2023-06-29T18:33:38.306582944+08:00 stderr F I0629 10:33:38.300746       1 controller.go:276] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5696188f-a423-42be-aca0-22680ef26ceb Cluster="default/sks-mgmt"
2023-06-29T18:33:53.808921324+08:00 stderr F E0629 10:33:53.807069       1 controller.go:199] "Failed to update KubeadmControlPlane Status" err="failed to create remote cluster client: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": http2: client connection lost" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5696188f-a423-42be-aca0-22680ef26ceb Cluster="default/sks-mgmt"
2023-06-29T18:33:53.811961844+08:00 stderr F E0629 10:33:53.809752       1 controller.go:329] "Reconciler error" err="[cannot get remote client to workload cluster: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers), failed to create remote cluster client: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": http2: client connection lost]" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5696188f-a423-42be-aca0-22680ef26ceb
2023-06-29T18:38:32.001485567+08:00 stderr F I0629 10:38:32.001239       1 controller.go:276] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5e663c6a-00cd-4d31-954f-336a10547d52 Cluster="default/sks-mgmt"
2023-06-29T18:38:33.763802075+08:00 stderr F I0629 10:38:33.762365       1 cluster_cache_tracker.go:298] "Creating cluster accessor for cluster \"default/sks-mgmt\" with the regular apiserver endpoint \"https://10.255.125.6:6443\"" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5e663c6a-00cd-4d31-954f-336a10547d52 Cluster="default/sks-mgmt"
2023-06-29T18:39:26.54150753+08:00 stderr F W0629 10:39:26.540559       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:26.541584905+08:00 stderr F I0629 10:39:26.540763       1 trace.go:219] Trace[1245889233]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169 (29-Jun-2023 10:39:16.536) (total time: 10003ms):
2023-06-29T18:39:26.541600266+08:00 stderr F Trace[1245889233]: ---"Objects listed" error:Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) 10003ms (10:39:26.540)
2023-06-29T18:39:26.541605636+08:00 stderr F Trace[1245889233]: [10.003856562s] [10.003856562s] END
2023-06-29T18:39:26.54161313+08:00 stderr F E0629 10:39:26.540925       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:31.421271219+08:00 stderr F W0629 10:39:31.420913       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": http2: client connection lost
2023-06-29T18:39:31.421356584+08:00 stderr F E0629 10:39:31.420985       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": http2: client connection lost
2023-06-29T18:39:31.421402287+08:00 stderr F E0629 10:39:31.420994       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
Failed to watch *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?allowWatchBookmarks=true&resourceVersion=35679197&timeout=10s&timeoutSeconds=413&watch=true": http2: client connection lost - error from a previous attempt: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:42.793292892+08:00 stderr F W0629 10:39:42.792403       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:42.793331013+08:00 stderr F I0629 10:39:42.792878       1 trace.go:219] Trace[664855972]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169 (29-Jun-2023 10:39:32.791) (total time: 10001ms):
2023-06-29T18:39:42.793336977+08:00 stderr F Trace[664855972]: ---"Objects listed" error:Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 10001ms (10:39:42.792)
2023-06-29T18:39:42.793341802+08:00 stderr F Trace[664855972]: [10.001440542s] [10.001440542s] END
2023-06-29T18:39:42.793355345+08:00 stderr F E0629 10:39:42.792914       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:44.285365522+08:00 stderr F W0629 10:39:44.284282       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:44.285415251+08:00 stderr F I0629 10:39:44.284479       1 trace.go:219] Trace[333138094]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169 (29-Jun-2023 10:39:34.282) (total time: 10001ms):
2023-06-29T18:39:44.285421731+08:00 stderr F Trace[333138094]: ---"Objects listed" error:Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 10001ms (10:39:44.284)
2023-06-29T18:39:44.2854264+08:00 stderr F Trace[333138094]: [10.001639339s] [10.001639339s] END
2023-06-29T18:39:44.28543114+08:00 stderr F E0629 10:39:44.284504       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:49.43914295+08:00 stderr F W0629 10:39:49.438066       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:49.441353046+08:00 stderr F E0629 10:39:49.438156       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.58127205+08:00 stderr F W0629 10:39:55.579978       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581324556+08:00 stderr F E0629 10:39:55.580061       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581331354+08:00 stderr F W0629 10:39:55.580109       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581337158+08:00 stderr F E0629 10:39:55.580196       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:04.806961247+08:00 stderr F W0629 10:40:04.802654       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:04.807016783+08:00 stderr F E0629 10:40:04.802774       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:14.017608903+08:00 stderr F W0629 10:40:14.017353       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10
Failed to watch *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?allowWatchBookmarks=true&resourceVersion=35679197&timeout=10s&timeoutSeconds=4
13&watch=true": http2: client connection lost - error from a previous attempt: net/http: request canceled (Client.Timeout exceeded while awaiting he
aders)
2023-06-29T18:39:42.793292892+08:00 stderr F W0629 10:39:42.792403       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: request canceled while waiting
 for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:42.793331013+08:00 stderr F I0629 10:39:42.792878       1 trace.go:219] Trace[664855972]: "Reflector ListAndWatch" name:k8s.io/clie
nt-go@v0.26.1/tools/cache/reflector.go:169 (29-Jun-2023 10:39:32.791) (total time: 10001ms):
2023-06-29T18:39:42.793336977+08:00 stderr F Trace[664855972]: ---"Objects listed" error:Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion
=35679197&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 10001ms (10:39:42.7
92)
2023-06-29T18:39:42.793341802+08:00 stderr F Trace[664855972]: [10.001440542s] [10.001440542s] END
2023-06-29T18:39:42.793355345+08:00 stderr F E0629 10:39:42.792914       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: requ
est canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:44.285365522+08:00 stderr F W0629 10:39:44.284282       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10
s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:44.285415251+08:00 stderr F I0629 10:39:44.284479       1 trace.go:219] Trace[333138094]: "Reflector ListAndWatch" name:k8s.io/clie
nt-go@v0.26.1/tools/cache/reflector.go:169 (29-Jun-2023 10:39:34.282) (total time: 10001ms):
2023-06-29T18:39:44.285421731+08:00 stderr F Trace[333138094]: ---"Objects listed" error:Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.
io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while
awaiting headers) 10001ms (10:39:44.284)
2023-06-29T18:39:44.2854264+08:00 stderr F Trace[333138094]: [10.001639339s] [10.001639339s] END
2023-06-29T18:39:44.28543114+08:00 stderr F E0629 10:39:44.284504       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: F
ailed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=
500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:49.43914295+08:00 stderr F W0629 10:39:49.438066       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: f
ailed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no r
oute to host
2023-06-29T18:39:49.441353046+08:00 stderr F E0629 10:39:49.438156       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.25
5.125.6:6443: connect: no route to host
2023-06-29T18:39:55.58127205+08:00 stderr F W0629 10:39:55.579978       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: f
ailed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s
": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581324556+08:00 stderr F E0629 10:39:55.580061       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit
=500&resourceVersion=0&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581331354+08:00 stderr F W0629 10:39:55.580109       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no
route to host
2023-06-29T18:39:55.581337158+08:00 stderr F E0629 10:39:55.580196       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.25
5.125.6:6443: connect: no route to host
2023-06-29T18:40:04.806961247+08:00 stderr F W0629 10:40:04.802654       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no
route to host
2023-06-29T18:40:04.807016783+08:00 stderr F E0629 10:40:04.802774       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.25
5.125.6:6443: connect: no route to host
2023-06-29T18:40:14.017608903+08:00 stderr F W0629 10:40:14.017353       1 reflector.go:424] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10
s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:14.017674617+08:00 stderr F E0629 10:40:14.017427       1 reflector.go:140] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169:
Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit
=500&resourceVersion=0&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:28.988010911+08:00 stderr F E0629 10:40:28.987542       1 cluster_cache_tracker.go:602] "remote/ClusterCacheTracker: Error health c
hecking cluster" err="Get \"https://10.255.125.6:6443/?timeout=5s\": context deadline exceeded" Cluster="default/sks-mgmt"

I killed the KCP process through kill -SIGQUIT and found that KCP reconcile stuck in Workload.EnsureResource and would not exit.

2023-06-29T19:38:59.267600649+08:00 stderr F
2023-06-29T19:38:59.267604842+08:00 stderr F goroutine 484 [select]:
2023-06-29T19:38:59.267608916+08:00 stderr F runtime.gopark(0xc000ea8bc0?, 0x2?, 0xa?, 0xa9?, 0xc000ea8b9c?)
2023-06-29T19:38:59.267613041+08:00 stderr F    runtime/proc.go:363 +0xd6 fp=0xc000ea8a20 sp=0xc000ea8a00 pc=0x43cb76
2023-06-29T19:38:59.267616605+08:00 stderr F runtime.selectgo(0xc000ea8bc0, 0xc000ea8b98, 0x0?, 0x0, 0x2?, 0x1)
2023-06-29T19:38:59.267620318+08:00 stderr F    runtime/select.go:328 +0x7bc fp=0xc000ea8b60 sp=0xc000ea8a20 pc=0x44ca1c
2023-06-29T19:38:59.267608916+08:00 stderr F runtime.gopark(0xc000ea8bc0?, 0x2?, 0xa?, 0xa9?, 0xc000ea8b9c?)
2023-06-29T19:38:59.267613041+08:00 stderr F    runtime/proc.go:363 +0xd6 fp=0xc000ea8a20 sp=0xc000ea8a00 pc=0x43cb76
2023-06-29T19:38:59.267616605+08:00 stderr F runtime.selectgo(0xc000ea8bc0, 0xc000ea8b98, 0x0?, 0x0, 0x2?, 0x1)
2023-06-29T19:38:59.267620318+08:00 stderr F    runtime/select.go:328 +0x7bc fp=0xc000ea8b60 sp=0xc000ea8a20 pc=0x44ca1c
2023-06-29T19:38:59.2676244+08:00 stderr F k8s.io/apimachinery/pkg/util/wait.WaitForWithContext({0x1f4b530, 0xc000c8e0c0}, 0xc000479f98, 0x108734a?)
2023-06-29T19:38:59.267628225+08:00 stderr F    k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:660 +0xe5 fp=0xc000ea8bf8 sp=0xc000ea8b60 pc=0x1087185
2023-06-29T19:38:59.267632189+08:00 stderr F k8s.io/apimachinery/pkg/util/wait.poll({0x1f4b530, 0xc000c8e0c0}, 0x68?, 0x1086345?, 0xc0008095e0?)
2023-06-29T19:38:59.267650652+08:00 stderr F    k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:596 +0x9a fp=0xc000ea8c38 sp=0xc000ea8bf8 pc=0x108703a
2023-06-29T19:38:59.267654508+08:00 stderr F k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext({0x1f4b530, 0xc000c8e0c0}, 0x20?, 0x2e19a80?)
2023-06-29T19:38:59.267658202+08:00 stderr F    k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:547 +0x49 fp=0xc000ea8c78 sp=0xc000ea8c38 pc=0x1086f49
2023-06-29T19:38:59.267661921+08:00 stderr F k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil(0xc0012ad9b0?, 0x0?, 0xc000720d10?)
2023-06-29T19:38:59.267665694+08:00 stderr F    k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:538 +0x7c fp=0xc000ea8ce8 sp=0xc000ea8c78 pc=0x1086e7c
2023-06-29T19:38:59.267669273+08:00 stderr F k8s.io/client-go/tools/cache.WaitForCacheSync(0xc0007305a0?, {0xc00055a470, 0x1, 0x1})
2023-06-29T19:38:59.267673449+08:00 stderr F    k8s.io/client-go@v0.26.1/tools/cache/shared_informer.go:288 +0x88 fp=0xc000ea8d38 sp=0xc000ea8ce8 pc=0x14b5988
2023-06-29T19:38:59.268143623+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/cache/internal.(*specificInformersMap).Get(0x403e56?, {0x1f4b5d8, 0xc0011376e0}, {{0x1c82d57, 0x19}, {0x1c613fc, 0x2}, {0x1869344, 0xb}}, {0x1f378c0, ...})
2023-06-29T19:38:59.268255075+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/cache/internal/informers_map.go:208 +0x1dd fp=0xc000ea8de8 sp=0xc000ea8d38 pc=0x14c20dd
2023-06-29T19:38:59.268262915+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/cache/internal.(*InformersMap).Get(0xc000720ed8?, {0x1f4b5d8?, 0xc0011376e0?}, {{0x1c82d57, 0x19}, {0x1c613fc, 0x2}, {0x1869344, 0xb}}, {0x1f378c0, ...})
2023-06-29T19:38:59.268267921+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/cache/internal/deleg_map.go:106 +0x1fc fp=0xc000ea8e50 sp=0xc000ea8de8 pc=0x14c10fc
2023-06-29T19:38:59.268280169+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).Get(0xc00011de58, {0x1f4b5d8, 0xc0011376e0}, {{0x1c6cada?, 0x19?}, {0x1c75dbd?, 0x2?}}, {0x1f623b0?, 0xc000169900?}, {0x0, ...})
2023-06-29T19:38:59.268288586+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/cache/informer_cache.go:60 +0x1a5 fp=0xc000ea8f58 sp=0xc000ea8e50 pc=0x14c7105
2023-06-29T19:38:59.268304866+08:00 stderr F sigs.k8s.io/cluster-api/controllers/remote.(*stoppableCache).Get(0x2e19a80?, {0x1f4b5d8?, 0xc0011376e0?}, {{0x1c6cada?, 0xc000721028?}, {0x1c75dbd?, 0x10?}}, {0x1f623b0?, 0xc000169900?}, {0x0, ...})
2023-06-29T19:38:59.268310266+08:00 stderr F    <autogenerated>:1 +0x86 fp=0xc000ea8fc8 sp=0xc000ea8f58 pc=0x15a0e06
2023-06-29T19:38:59.268314821+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/client.(*delegatingReader).Get(0xc000d22ec0, {0x1f4b5d8, 0xc0011376e0}, {{0x1c6cada?, 0xc0007210c8?}, {0x1c75dbd?, 0x1bca0c0?}}, {0x1f623b0?, 0xc000169900?}, {0x0, ...})
2023-06-29T19:38:59.268319147+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/client/split.go:132 +0x1ab fp=0xc000ea9068 sp=0xc000ea8fc8 pc=0x149c62b
2023-06-29T19:38:59.26835893+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/client.(*delegatingClient).Get(0xc000b5c700?, {0x1f4b5d8?, 0xc0011376e0?}, {{0x1c6cada?, 0x40d8bf?}, {0x1c75dbd?, 0x80?}}, {0x1f623b0?, 0xc000169900?}, {0x0, ...})
2023-06-29T19:38:59.268376782+08:00 stderr F    <autogenerated>:1 +0x86 fp=0xc000ea90d8 sp=0xc000ea9068 pc=0x14a5ac6
2023-06-29T19:38:59.268407267+08:00 stderr F sigs.k8s.io/cluster-api/controlplane/kubeadm/internal.(*Workload).EnsureResource(0xc000a12180, {0x1f4b5d8, 0xc0011376e0}, {0x1f623b0, 0xc0001697c0})
2023-06-29T19:38:59.268420905+08:00 stderr F    sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/workload_cluster_rbac.go:55 +0x103 fp=0xc000ea91b8 sp=0xc000ea90d8 pc=0x182c143
2023-06-29T19:38:59.268432866+08:00 stderr F sigs.k8s.io/cluster-api/controlplane/kubeadm/internal.(*Workload).AllowBootstrapTokensToGetNodes(0xc0000dd530?, {0x1f4b5d8, 0xc0011376e0})
2023-06-29T19:38:59.268448472+08:00 stderr F    sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/workload_cluster_rbac.go:71 +0x1aa fp=0xc000ea9210 sp=0xc000ea91b8 pc=0x182c56a
2023-06-29T19:38:59.26849179+08:00 stderr F sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).reconcile(0xc0001a1d60, {0x1f4b5d8, 0xc0011376e0}, 0xc00099f860, 0xc00060a700)
2023-06-29T19:38:59.268502139+08:00 stderr F    sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:421 +0x168a fp=0xc000ea9958 sp=0xc000ea9210 pc=0x183540a
2023-06-29T19:38:59.268565527+08:00 stderr F sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).Reconcile(0xc0001a1d60, {0x1f4b5d8, 0xc001137530}, {{{0xc000be5416?, 0x10?}, {0xc000478030?, 0x40dc07?}}})
2023-06-29T19:38:59.268572212+08:00 stderr F    sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:233 +0x7f0 fp=0xc000ea9cc0 sp=0xc000ea9958 pc=0x1832eb0
2023-06-29T19:38:59.268598232+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1f4b5d8?, {0x1f4b5d8?, 0xc001137530?}, {{{0xc000be5416?, 0x1961da0?}, {0xc000478030?, 0x0?}}})
2023-06-29T19:38:59.268613358+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122 +0xc8 fp=0xc000ea9d60 sp=0xc000ea9cc0 pc=0x14d4128
2023-06-29T19:38:59.268647867+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001a1e00, {0x1f4b530, 0xc000050a40}, {0x1a71d80?, 0xc001144a00?})
2023-06-29T19:38:59.268664604+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323 +0x38f fp=0xc000ea9ee0 sp=0xc000ea9d60 pc=0x14d628f
2023-06-29T19:38:59.268680413+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001a1e00, {0x1f4b530, 0xc000050a40})
2023-06-29T19:38:59.268699436+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274 +0x1d9 fp=0xc000ea9f80 sp=0xc000ea9ee0 pc=0x14d5a79
2023-06-29T19:38:59.268713362+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
2023-06-29T19:38:59.268724902+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235 +0x85 fp=0xc000ea9fe0 sp=0xc000ea9f80 pc=0x14d54c5
2023-06-29T19:38:59.268733815+08:00 stderr F runtime.goexit()
2023-06-29T19:38:59.268746223+08:00 stderr F    runtime/asm_amd64.s:1594 +0x1 fp=0xc000ea9fe8 sp=0xc000ea9fe0 pc=0x46ddc1
2023-06-29T19:38:59.268763535+08:00 stderr F created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
2023-06-29T19:38:59.268769063+08:00 stderr F    sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:231 +0x333

When I was analyzing this problem, i found that KCP obtains the workload Cluster Client through ClusterCacheTracker.GetClient. When using this client to get resource for the first time, it will start Informer to cache resources, and the getting request will not return until the informer.HasSynced=true.

// Get will create a new Informer and add it to the map of specificInformersMap if none exists.  Returns
// the Informer from the map.
func (ip *specificInformersMap) Get(ctx context.Context, gvk schema.GroupVersionKind, obj runtime.Object) (bool, *MapEntry, error) {
    // Return the informer if it is found
    i, started, ok := func() (*MapEntry, bool, bool) {
        ip.mu.RLock()
        defer ip.mu.RUnlock()
        i, ok := ip.informersByGVK[gvk]
        return i, ip.started, ok
    }()

    if !ok {
        var err error
        if i, started, err = ip.addInformerToMap(gvk, obj); err != nil {
            return started, nil, err
        }
    }

    if started && !i.Informer.HasSynced() {
        // Wait for it to sync before returning the Informer so that folks don't read from a stale cache.
        if !cache.WaitForCacheSync(ctx.Done(), i.Informer.HasSynced) {
            return started, nil, apierrors.NewTimeoutError(fmt.Sprintf("failed waiting for %T Informer to sync", obj), 0)
        }
    }

    return started, i, nil
}

When the workload cluster API Server is unreachable, the getting request will be blocked due to infromer.HasSynced=false. And after ClusterCacheTracker.healthCheckCluster failed, the informer cache will no longer try to synchronize, and infromer.HasSynced will always be false.

https://github.com/kubernetes-sigs/cluster-api/blob/14b88ca091102b72c591cb0357199834385df478/controllers/remote/cluster_cache_tracker.go#L527-L605

https://github.com/kubernetes-sigs/cluster-api/blob/14b88ca091102b72c591cb0357199834385df478/controllers/remote/cluster_cache_tracker.go#L419-L435

After that, even if the workload cluster API Server is reachable, the getting request will always be blocked and case KCP reconcile hang.

What did you expect to happen?

When the workload cluster API server is accessible, KCP can reconcile normally.

Cluster API version

v1.4.3

Kubernetes version

v1.24.13

Anything else you would like to add?

This issue can be easily reproduced by the following method.

Pick this commit https://github.com/smartxworks/cluster-api/commit/199100917c3b50b4bf3765b566dc406703ca4875 and update the KCP Deployment.
When kcp first prints the log start wait for bug, inject a fault to ensure that the workload cluster apiserver is unreachable
When kcp prints the log remote/ClusterCacheTracker: Error health checking cluster, the problem can be reproduced. Even if the workload cluster apiserver can be accessed later, KCP will not reconcile.

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

killianmuldoon commented 1 year ago

/triage accepted

I'll take a look at reproducing this. Thanks for reporting, and thanks for supplying clear steps to reproduce this issue.

Levi080513 commented 1 year ago

@killianmuldoon Have you tried to reproduce the problem?

jessehu commented 1 year ago

@killianmuldoon we can easily reproduce this issue in 1 Control Plane cluster by injecting a fault to ensure that the workload cluster apiserver is unreachable

Levi080513 commented 1 year ago

This problem does not seem to be reproduced in the latest code.
When workload cluster API Server is unreachable, ClusterCacheTracker.healthCheckCluster failed will not cause stopped cache. Is this expected ? Maybe it has something to do with this change? https://github.com/kubernetes-sigs/cluster-api/blob/7399f5fc3021a76814d6126eff8eff49e520f61f/controllers/remote/cluster_cache_tracker.go#L654

sbueringer commented 1 year ago

Hm no this code just ignores errors if the cache was explicitly shutdown. But we only do this after Cluster deletion or when we couldn't sync in createClient

Levi080513 commented 1 year ago

https://github.com/smartxworks/cluster-api/commit/2f1d311ef166477dc57fce3b2802b08f4c2340f1 I added log printing and retested this scenario, when unhealthyCount >= in.unhealthyThreshold, the error is like this

Get "https://10.255.26.34:6443/?timeout=5s": context deadline exceeded
true
Get "https://10.255.26.34:6443/?timeout=5s": context deadline exceeded
true

It looks like the error is indeed being ignored. This error should be redefined. https://github.com/kubernetes-sigs/cluster-api/blob/7399f5fc3021a76814d6126eff8eff49e520f61f/controllers/remote/cluster_cache_tracker.go#L643

sbueringer commented 1 year ago

Thx! I think now I understand it.

The problem is that the wail utils in k8s.io/apimachinery were changed with v0.27 (we picked it up with CR v0.15)

// ErrWaitTimeout is returned when the condition was not satisfied in time.
//
// Deprecated: This type will be made private in favor of Interrupted()
// for checking errors or ErrorInterrupted(err) for returning a wrapped error.
var ErrWaitTimeout = ErrorInterrupted(errors.New("timed out waiting for the condition"))

Previously we just checked for this error. But now we use the util func they recommended. And that one is capturing more:

func Interrupted(err error) bool {
    switch {
    case errors.Is(err, errWaitTimeout),
        errors.Is(err, context.Canceled),
        errors.Is(err, context.DeadlineExceeded):
        return true
    default:
        return false
    }
}

sbueringer commented 1 year ago

But it doesn't make any sense because this change is only in main & release-1.5. It doesn't exist on release-1.4

sbueringer commented 1 year ago

I wonder if we should just always delete the accessor

sbueringer commented 1 year ago

For release-1.5 and main Interrupted is definitely the problem:

sbueringer commented 1 year ago

This should fix it for main and release-1.5 https://github.com/kubernetes-sigs/cluster-api/pull/9025

I'll take a look at release-1.4 now

sbueringer commented 1 year ago

@Levi080513 @jessehu Just to get as much data as possible. With which versions of CAPI did you encounter this issue? (just so I know on which branches I should investigate this)

Levi080513 commented 1 year ago

v1.4.3 and the feature of lazyRestMapper is closed.

sbueringer commented 1 year ago

Okay so back to your issue on release-1.4. I think I know what's going on.

Initial client creation:

controlplane/kubeadm/internal/cluster.go:115 m.Tracker.GetClient
- controllers/remote/cluster_cache_tracker.go:331 if !cache.WaitForCacheSync(cacheSyncCtx) {
- This ensures the cache is synced at time of client creation
- The client is now cached

APIserver becomes unreachable

In a Reconcile after APIServer becomes unreachable

controlplane/kubeadm/internal/workload_cluster_rbac.go:55 w.Client.Get
- pkg/cache/internal/informers_map.go:208
- We didn't request this resource before so the Get call now creates a new informer and then waits until it is synced
- Because the cache never syncs and we don't set a timeout in this call via ctx we're stuck forever

I think at this point it doesn't matter if the healtcheck deletes the clusterAccessor / client / cache as we'll just stay stuck. The cacheCtx we cancel when stopping the cache has no impact on the Get call as it is not used there (potential idea for a fix)

(stuck in WaitForCacheSync with no timeout in the ctx)

Levi080513 commented 1 year ago

Yes, what you said is correct. One fix is to cancel all get/list requests when the cacheCtx is closed.

Levi080513 commented 1 year ago

Can i try to fix this? @sbueringer

sbueringer commented 1 year ago

Oh sorry. Already working on a fix. PR should be up in a bit

(want to get this up / reviewed / merged ASAP to get it into the release & patch releases on Tuesday)

sbueringer commented 1 year ago

@jessehu @Levi080513 It would be super helpful if you can take a look at https://github.com/kubernetes-sigs/cluster-api/pull/9027 and potentially verify if the fix works for you (it works for me)

jessehu commented 1 year ago

Thanks a lot @sbueringer. We verified the PR works as expected.

sbueringer commented 1 year ago

We'll cherry-pick both fixes. So with the next series of releases (next Tuesday) all should be fixed

kubernetes-sigs / cluster-api