syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner :rocket: The best way to manage Kubernetes clusters on Hetzner, fully declarative, Kubernetes-native and with self-healing capabilities
https://caph.syself.com
Apache License 2.0
604 stars 56 forks source link

Control Plane node is not responding #1434

Open Stringls opened 1 month ago

Stringls commented 1 month ago

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Firstly, I'd like to share our setup specifications We've been having a hybrid Hetzner cluster (1 CP is a Hcloud VM, 1 MD is a Hetzner bare metal server) setup for 2 years. CAPI Hetzner does not support private networks, cluster operates over public network. Following your advice we use Cilium Host Firewall policies to restrict ingress/egress rules on nodes. We only open a port 6443 to world entity so Hetzner LB can reach API server on Control Plane server. That means no ETCD ports are open, so we cannot add more CPs to the setup after Cilium policies are applied.

Now, about an issue we are facing currently. Recently, a few weeks ago, we started facing an issue when the cluster becomes unreachable and kubectl commands fails with either TLS handshake timeout or EOF errors. When we check LB status, it's unhealthy because Control Plane is not responding to LB requests, we also cannot SSH into a CP node. A hot fix is to hard reboot the VM via Hcloud Console, but then an issue happens again.

We have some logs from kube-apiserver and etcd containers from CP node retrieved with crictl logs command

API server logs

E0812 11:18:38.066456       1 controller.go:131] Unable to remove endpoints from kubernetes service: Get "https://[::1]:6443/api/v1/namespaces/default/endpoints/kubernetes": dial tcp [::1]:6443: connect: connection refused
I0812 11:18:39.373962       1 trace.go:236] Trace[569378699]: "List" accept:application/json,audit-id:9488b352-92b5-4a61-ae39-e77cf5b0d8c8,client:65.109.93.48,protocol:HTTP/2.0,resource:pipelineruns,scope:namespace,url:/apis/tekton.dev/v1/namespaces/pipelines-as-code/pipelineruns/,user-agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36,verb:LIST (12-Aug-2024 11:18:32.489) (total time: 6883ms):
Trace[569378699]: ["List(recursive=true) etcd3" audit-id:9488b352-92b5-4a61-ae39-e77cf5b0d8c8,key:/tekton.dev/pipelineruns/pipelines-as-code,resourceVersion:,resourceVersionMatch:,limit:0,continue: 6883ms (11:18:32.490)]
Trace[569378699]: ---"Writing http response done" count:260 1985ms (11:18:39.373)
Trace[569378699]: [6.883934418s] [6.883934418s] END
I0812 11:18:39.378949       1 controller.go:159] Shutting down quota evaluator
I0812 11:18:39.379319       1 controller.go:178] quota evaluator worker shutdown
I0812 11:18:39.379662       1 controller.go:178] quota evaluator worker shutdown
I0812 11:18:39.379732       1 controller.go:178] quota evaluator worker shutdown
I0812 11:18:39.379767       1 controller.go:178] quota evaluator worker shutdown
I0812 11:18:39.379814       1 controller.go:178] quota evaluator worker shutdown
...
I0812 11:18:35.846631       1 trace.go:236] Trace[673438405]: "List" accept:application/json, */*,audit-id:81e86f77-9334-4e5f-956c-19700a05f0a1,client:65.109.93.48,protocol:HTTP/2.0,resource:tektoninstallersets,scope:cluster,url:/apis/operator.tekton.dev/v1alpha1/tektoninstallersets,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format,verb:LIST (12-Aug-2024 11:18:33.779) (total time: 2067ms):
Trace[673438405]: ["List(recursive=true) etcd3" audit-id:81e86f77-9334-4e5f-956c-19700a05f0a1,key:/operator.tekton.dev/tektoninstallersets,resourceVersion:,resourceVersionMatch:,limit:0,continue: 2067ms (11:18:33.779)]
Trace[673438405]: [2.067348225s] [2.067348225s] END
W0812 11:18:36.544191       1 controller.go:141] RemoveEndpoints() timed out
I0812 11:18:36.563430       1 dynamic_cafile_content.go:171] "Shutting down controller" name="request-header::/etc/kubernetes/pki/front-proxy-ca.crt"
I0812 11:18:36.572312       1 storage_flowcontrol.go:187] APF bootstrap ensurer is exiting
I0812 11:18:36.572373       1 cluster_authentication_trust_controller.go:463] Shutting down cluster_authentication_trust_controller controller
I0812 11:18:36.575736       1 object_count_tracker.go:151] "StorageObjectCountTracker pruner is exiting"
I0812 11:18:36.577015       1 controller.go:115] Shutting down OpenAPI V3 controller
I0812 11:18:36.578914       1 autoregister_controller.go:165] Shutting down autoregister controller
I0812 11:18:36.579889       1 crd_finalizer.go:278] Shutting down CRDFinalizer
I0812 11:18:36.579927       1 apiapproval_controller.go:198] Shutting down KubernetesAPIApprovalPolicyConformantConditionController
I0812 11:18:36.580967       1 nonstructuralschema_controller.go:204] Shutting down NonStructuralSchemaConditionController
I0812 11:18:36.581513       1 establishing_controller.go:87] Shutting down EstablishingController
I0812 11:18:36.581537       1 naming_controller.go:302] Shutting down NamingConditionController
I0812 11:18:36.581563       1 controller.go:162] Shutting down OpenAPI controller
I0812 11:18:36.581593       1 crdregistration_controller.go:142] Shutting down crd-autoregister controller
I0812 11:18:36.581612       1 customresource_discovery_controller.go:325] Shutting down DiscoveryController
I0812 11:18:36.582020       1 available_controller.go:439] Shutting down AvailableConditionController
I0812 11:18:36.582597       1 apf_controller.go:386] Shutting down API Priority and Fairness config worker
I0812 11:18:36.582637       1 system_namespaces_controller.go:77] Shutting down system namespaces controller
I0812 11:18:36.583010       1 apiservice_controller.go:131] Shutting down APIServiceRegistrationController
I0812 11:18:36.584231       1 gc_controller.go:91] Shutting down apiserver lease garbage collector
I0812 11:18:36.586173       1 controller.go:129] Ending legacy_token_tracking_controller
I0812 11:18:36.586217       1 controller.go:130] Shutting down legacy_token_tracking_controller
I0812 11:18:36.586244       1 gc_controller.go:91] Shutting down apiserver lease garbage collector
I0812 11:18:36.587684       1 dynamic_serving_content.go:146] "Shutting down controller" name="aggregator-proxy-cert::/etc/kubernetes/pki/front-proxy-client.crt::/etc/kubernetes/pki/front-proxy-client.key"
I0812 11:18:36.589797       1 dynamic_cafile_content.go:171] "Shutting down controller" name="request-header::/etc/kubernetes/pki/front-proxy-ca.crt"
I0812 11:18:36.590708       1 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
I0812 11:18:36.590766       1 controller.go:86] Shutting down OpenAPI V3 AggregationController
I0812 11:18:36.590796       1 controller.go:84] Shutting down OpenAPI AggregationController
I0812 11:18:36.594984       1 secure_serving.go:258] Stopped listening on [::]:6443
I0812 11:18:36.595581       1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I0812 11:18:36.596500       1 dynamic_cafile_content.go:171] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
I0812 11:18:36.598584       1 dynamic_serving_content.go:146] "Shutting down controller" name="serving-cert::/etc/kubernetes/pki/apiserver.crt::/etc/kubernetes/pki/apiserver.key"
I0812 11:18:37.033165       1 trace.go:236] Trace[204076444]: "Get" accept:application/json, */*,audit-id:6017e60b-08c9-4962-a66d-91845975cd70,client:...,protocol:HTTP/2.0,resource:leases,scope:resource,url:/apis/coordination.k8s.io/v1/namespaces/tekton-operator/leases/controller.tektonresolverframework.git.00-of-01,user-agent:resolvers/v0.0.0 (linux/amd64) kubernetes/$Format,verb:GET (12-Aug-2024 11:18:35.538) (total time: 1492ms):

ETCD logs

{"level":"warn","ts":"2024-08-12T11:03:57.379342Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-08-12T11:03:56.788653Z","time spent":"590.660665ms","remote":"127.0.0.1:47576","response type":"/etcdserverpb.KV/Txn","request count":1,"request size":505,"response count":0,"response size":43,"request content":"compare:<target:MOD key:\"/registry/leases/kube-system/cilium-operator-resource-lock\" mod_revision:16152659 > success:<request_put:<key:\"/registry/leases/kube-system/cilium-operator-resource-lock\" value_size:439 >> failure:<request_range:<key:\"/registry/leases/kube-system/cilium-operator-resource-lock\" > >"}
{"level":"warn","ts":"2024-08-12T11:03:57.379475Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"173.348569ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/tekton.dev/tasks/\" range_end:\"/registry/tekton.dev/tasks0\" count_only:true ","response":"range_response_count:0 size:9"}
...
{"level":"warn","ts":"2024-08-12T11:06:43.28665Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"005291459456ea11","error":"etcdserver: request timed out"}
{"level":"info","ts":"2024-08-12T11:06:50.426133Z","caller":"traceutil/trace.go:171","msg":"trace[1752608223] range","detail":"{range_begin:; range_end:; response_count:0; response_revision:16154455; }","duration":"9.211085021s","start":"2024-08-12T11:06:36.762523Z","end":"2024-08-12T11:06:45.973608Z","steps":["trace[1752608223] 'get authentication metadata'  (duration: 973.526291ms)"],"step_count":1}
{"level":"warn","ts":"2024-08-12T11:07:23.340228Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"005291459456ea11","error":"etcdserver: request timed out"}
{"level":"warn","ts":"2024-08-12T11:09:39.806064Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594554187","error":"etcdserver: request timed out"}
{"level":"warn","ts":"2024-08-12T11:12:28.134091Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"005291459455656d","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-08-12T11:12:44.724112Z","caller":"wal/wal.go:805","msg":"slow fdatasync","took":"4.557553205s","expected-duration":"1s"}
{"level":"warn","ts":"2024-08-12T11:12:45.399681Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594555f26","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-08-12T11:13:03.172379Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594554187","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-08-12T11:13:05.409878Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594555290","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-08-12T11:13:42.699339Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594554a1e","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-08-12T11:13:32.719716Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"005291459455656d","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-08-12T11:13:36.006056Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594554a1e","error":"etcdserver: request timed out"}
...
{"level":"warn","ts":"2024-08-12T11:18:15.116831Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594555290","error":"etcdserver: request timed out"}
{"level":"warn","ts":"2024-08-12T11:18:15.243851Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"109.05297ms","expected-duration":"100ms","prefix":"","request":"header:<ID:23240676117768790 > lease_revoke:<id:0052914594554a1e>","response":"size:30"}
{"level":"warn","ts":"2024-08-12T11:18:15.257882Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"17.153µs","request":"header:<ID:23240676117768816 > lease_revoke:<id:0052914594555290>","response":"size:30","error":"lease not found"}
{"level":"warn","ts":"2024-08-12T11:18:15.258519Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"133.465µs","request":"header:<ID:23240676117768882 > lease_revoke:<id:0052914594555290>","response":"size:30","error":"lease not found"}
{"level":"warn","ts":"2024-08-12T11:18:15.258608Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"7.644µs","request":"header:<ID:23240676117768834 > lease_revoke:<id:005291459455656d>","response":"size:30","error":"lease not found"}
...
{"level":"warn","ts":"2024-08-12T11:18:15.277431Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594554187","error":"lease not found"}
{"level":"warn","ts":"2024-08-12T11:18:15.277979Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"00529145945580e7","error":"lease not found"}
{"level":"warn","ts":"2024-08-12T11:18:15.278069Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"0052914594554a1e","error":"lease not found"}
{"level":"info","ts":"2024-08-12T11:18:15.243003Z","caller":"osutil/interrupt_unix.go:64","msg":"received signal; shutting down","signal":"terminated"}
{"level":"info","ts":"2024-08-12T11:18:15.28148Z","caller":"embed/etcd.go:376","msg":"closing etcd server","name":"capi-sys-cluster-control-plane-r8kvf","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://...:2380/"],"advertise-client-urls":["https://...:2379/"]}
{"level":"warn","ts":"2024-08-12T11:18:15.312243Z","caller":"embed/serve.go:212","msg":"stopping secure grpc server due to error","error":"accept tcp 127.0.0.1:2379: use of closed network connection"}
{"level":"warn","ts":"2024-08-12T11:18:15.324532Z","caller":"embed/serve.go:214","msg":"stopped secure grpc server due to error","error":"accept tcp 127.0.0.1:2379: use of closed network connection"}
{"level":"warn","ts":"2024-08-12T11:18:15.333331Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"127.0.0.1:35042","server-name":"","error":"read tcp 127.0.0.1:2379->127.0.0.1:35042: use of closed network connection"}
{"level":"warn","ts":"2024-08-12T11:18:15.333481Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"127.0.0.1:35018","server-name":"","error":"read tcp 127.0.0.1:2379->127.0.0.1:35018: use of closed network connection"}
{"level":"warn","ts":"2024-08-12T11:18:15.333639Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"127.0.0.1:35026","server-name":"","error":"read tcp 127.0.0.1:2379->127.0.0.1:35026: use of closed network connection"}
{"level":"warn","ts":"2024-08-12T11:18:15.333686Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"127.0.0.1:35040","server-name":"","error":"read tcp 127.0.0.1:2379->127.0.0.1:35040: use of closed network connection"}
{"level":"warn","ts":"2024-08-12T11:18:16.49205Z","caller":"embed/serve.go:212","msg":"stopping secure grpc server due to error","error":"accept tcp 49.12.210.254:2379: use of closed network connection"}
{"level":"warn","ts":"2024-08-12T11:18:16.495441Z","caller":"embed/serve.go:214","msg":"stopped secure grpc server due to error","error":"accept tcp 49.12.210.254:2379: use of closed network connection"}
{"level":"info","ts":"2024-08-12T11:18:16.500534Z","caller":"etcdserver/server.go:1465","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"ed87790b8470052","current-leader-member-id":"ed87790b8470052"}
{"level":"info","ts":"2024-08-12T11:18:16.51348Z","caller":"embed/etcd.go:579","msg":"stopping serving peer traffic","address":"49.12.210.254:2380"}
{"level":"info","ts":"2024-08-12T11:18:16.516225Z","caller":"embed/etcd.go:584","msg":"stopped serving peer traffic","address":"49.12.210.254:2380"}
{"level":"info","ts":"2024-08-12T11:18:16.516603Z","caller":"embed/etcd.go:378","msg":"closed etcd server","name":"capi-sys-cluster-control-plane-r8kvf","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://...:2380/"],"advertise-client-urls":["https://...:2379/"]}

There are a few other logs that we got from the first time it happened to us

API server logs

W0801 08:11:33.863705       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: operation was canceled"
W0801 08:11:33.867945       1 logging.go:59] [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: operation was canceled"
E0801 08:11:33.875337       1 oidc.go:335] oidc authenticator: initializing plugin: Get "https://.../.well-known/openid-configuration": dial tcp: lookup ... on [2606:...:....::1111]:53: dial udp [2606:....:...::1111]:53: connect: network is unreachable
F0801 08:11:33.883805       1 instance.go:291] Error creating leases: error creating storage factory: context deadline exceeded
W0801 08:11:33.883892       1 logging.go:59] [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: operation was canceled"

ETCD server logs

2024/08/01 08:11:10 WARNING: [core] [Channel #3 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: i/o timeout"
2024/08/01 08:11:12 WARNING: [core] [Channel #4 SubChannel #5] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: i/o timeout"
2024/08/01 08:12:14 WARNING: [core] [Channel #3 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: i/o timeout"
I0801 08:16:42.467655    3035 logs.go:387] "Incomplete line in log file" path="/var/log/pods/kube-system_etcd-..._f66fcbea99384b2bf26df262a2802915/etcd/13.log" line="\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\

What did you expect to happen:

We'd like to see an operational hybrid cluster as we are used to have with no issues on CP side.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

As I said, cluster was operational with the same set of Cilium policies for a long time. So we are not quite sure it's because of Cilium policy, that's why we decided to open an issue here to get an assistance from you. Thank you.

If you need any other information (that we are allowed to share), we'd love to share it with you.

Environment:

guettli commented 1 month ago

cluster-api-provider-hetzner version: v1.0.0-beta.37

We released 41 some days ago. Please update, maybe the issue is already solved by that.