techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

TASK [k3s_agent : Enable and check K3s service] hanging forever #371

Closed AshDevFr closed 1 year ago

AshDevFr commented 1 year ago

Expected Behavior

Proceed with the playbook

Current Behavior

When running the playbook, it hangs at the k3s_agent : Enable and check K3s service task.

I tried a bunch of times to reset and re-run but without success.

I checked the discussion here but it did not fix the issue I have.

if I use my master node ip address like suggested by FrostyFitz in the discussion it will work but if I put another address for the endpoint it does not work. It's almost as if the vip address is never created. It does not respond to ping. I've check all the nodes and I have eth0 everywhere. Also my token is correct

I tried to use either same network for the virtual ip and ip range (10.193.1.1/24) or a different network (10.193.20.1/24) to have more ips but the result is the same.

Steps to Reproduce

  1. Clone the project
  2. Update variables
  3. Run

Context (variables)

Operating system: Ubuntu 22.04

Hardware: Running 5 VM on Proxmox. All of them were created using terraform and a cloud-init template.

Variables Used

all.yml

k3s_version: v1.25.12+k3s1
ansible_user: ubuntu
systemd_dir: /etc/systemd/system

flannel_iface: "eth0"

apiserver_endpoint: "10.193.20.10"

k3s_token: "sKcyohCecVULptzpvatzHrYagPGL4mfN"

extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik
extra_agent_args: >-
  {{ extra_args }}

kube_vip_tag_version: "v0.5.12"

metal_lb_speaker_tag_version: "v0.13.9"
metal_lb_controller_tag_version: "v0.13.9"

metal_lb_ip_range: "10.193.20.20-10.193.20.99"

Hosts

host.ini

[master]
10.193.1.[155:157]

[node]
10.193.1.[158:159]

[k3s_cluster:children]
master
node

Logs

On the master node

Sep 24 21:34:05 k3s-1 k3s[3394]: E0924 21:34:05.414503    3394 secret.go:192] Couldn't get secret metallb-system/memberlist: secret "memberlist" not found
Sep 24 21:34:05 k3s-1 k3s[3394]: E0924 21:34:05.415375    3394 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist podName:cefc73a3-6380-43f3-8b55-90c9beeedae1 nodeName:}" failed. No retries permitted until 2023-09-24 21:34:37.415339846 -0600 MDT m=+82.612753266 (durationBeforeRetry 32s). Error: MountVolume.SetUp failed for volume "memberlist" (UniqueName: "kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist") pod "speaker-wxc6m" (UID: "cefc73a3-6380-43f3-8b55-90c9beeedae1") : secret "memberlist" not found
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.514639    3394 shared_informer.go:259] Caches are synced for resource quota
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.535098    3394 shared_informer.go:259] Caches are synced for garbage collector
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.563102    3394 shared_informer.go:259] Caches are synced for resource quota
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.585626    3394 shared_informer.go:259] Caches are synced for garbage collector
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.585698    3394 garbagecollector.go:163] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.586814    3394 trace.go:219] Trace[1377249825]: "Proxy via http_connect protocol over tcp" address:10.42.0.4:10250 (24-Sep-2023 21:34:04.982) (total time: 604ms):
Sep 24 21:34:05 k3s-1 k3s[3394]: Trace[1377249825]: [604.629966ms] [604.629966ms] END
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.587493    3394 trace.go:219] Trace[1971106780]: "Proxy via http_connect protocol over tcp" address:10.42.0.4:10250 (24-Sep-2023 21:34:04.981) (total time: 606ms):
Sep 24 21:34:05 k3s-1 k3s[3394]: Trace[1971106780]: [606.059786ms] [606.059786ms] END
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.587659    3394 trace.go:219] Trace[867571666]: "Proxy via http_connect protocol over tcp" address:10.42.0.4:10250 (24-Sep-2023 21:34:04.982) (total time: 604ms):
Sep 24 21:34:05 k3s-1 k3s[3394]: Trace[867571666]: [604.799982ms] [604.799982ms] END
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.587830    3394 trace.go:219] Trace[1441376887]: "Proxy via http_connect protocol over tcp" address:10.42.0.4:10250 (24-Sep-2023 21:34:04.983) (total time: 604ms):
Sep 24 21:34:05 k3s-1 k3s[3394]: Trace[1441376887]: [604.256031ms] [604.256031ms] END
Sep 24 21:34:05 k3s-1 k3s[3394]: I0924 21:34:05.587994    3394 trace.go:219] Trace[33641745]: "Proxy via http_connect protocol over tcp" address:10.42.0.4:10250 (24-Sep-2023 21:34:04.980) (total time: 607ms):
Sep 24 21:34:05 k3s-1 k3s[3394]: Trace[33641745]: [607.771609ms] [607.771609ms] END
Sep 24 21:34:05 k3s-1 k3s[3394]: E0924 21:34:05.613528    3394 available_controller.go:524] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Sep 24 21:34:05 k3s-1 k3s[3394]: {"level":"info","ts":"2023-09-24T21:34:05.971-0600","caller":"traceutil/trace.go:171","msg":"trace[911155418] transaction","detail":"{read_only:false; response_revision:1311; number_of_response:1; }","duration":"106.54913ms","start":"2023-09-24T21:34:05.865-0600","end":"2023-09-24T21:34:05.971-0600","steps":["trace[911155418] 'process raft request'  (duration: 106.291865ms)"],"step_count":1}
Sep 24 21:34:32 k3s-1 k3s[3394]: E0924 21:34:32.294794    3394 remote_runtime.go:625] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"24e933ea1791164db9c6855da0d8b66bbb263af9857ee8e53d5806b9eb4fdc98\": not found" containerID="24e933ea1791164db9c6855da0d8b66bbb263af9857ee8e53d5806b9eb4fdc98"
Sep 24 21:34:32 k3s-1 k3s[3394]: I0924 21:34:32.296444    3394 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="24e933ea1791164db9c6855da0d8b66bbb263af9857ee8e53d5806b9eb4fdc98" err="rpc error: code = NotFound desc = an error occurred when try to find container \"24e933ea1791164db9c6855da0d8b66bbb263af9857ee8e53d5806b9eb4fdc98\": not found"
Sep 24 21:34:32 k3s-1 k3s[3394]: E0924 21:34:32.303636    3394 remote_runtime.go:625] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"447e3faf5b21addde533ac153954515a99aad4540b75a7d9cc37af23ef5ea000\": not found" containerID="447e3faf5b21addde533ac153954515a99aad4540b75a7d9cc37af23ef5ea000"
Sep 24 21:34:32 k3s-1 k3s[3394]: I0924 21:34:32.303760    3394 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="447e3faf5b21addde533ac153954515a99aad4540b75a7d9cc37af23ef5ea000" err="rpc error: code = NotFound desc = an error occurred when try to find container \"447e3faf5b21addde533ac153954515a99aad4540b75a7d9cc37af23ef5ea000\": not found"
Sep 24 21:34:32 k3s-1 k3s[3394]: E0924 21:34:32.307061    3394 remote_runtime.go:625] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"74a24e625656230a759a2fa931b60272977b19bbfc2e02ca25de0ddb7c2a2abd\": not found" containerID="74a24e625656230a759a2fa931b60272977b19bbfc2e02ca25de0ddb7c2a2abd"
Sep 24 21:34:32 k3s-1 k3s[3394]: I0924 21:34:32.307169    3394 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="74a24e625656230a759a2fa931b60272977b19bbfc2e02ca25de0ddb7c2a2abd" err="rpc error: code = NotFound desc = an error occurred when try to find container \"74a24e625656230a759a2fa931b60272977b19bbfc2e02ca25de0ddb7c2a2abd\": not found"
Sep 24 21:34:32 k3s-1 k3s[3394]: E0924 21:34:32.311465    3394 remote_runtime.go:625] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"fdf5f59e2dde882b711124b9758e5945c513179afdc73ad1e0cd4de071e13026\": not found" containerID="fdf5f59e2dde882b711124b9758e5945c513179afdc73ad1e0cd4de071e13026"
Sep 24 21:34:32 k3s-1 k3s[3394]: I0924 21:34:32.311577    3394 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="fdf5f59e2dde882b711124b9758e5945c513179afdc73ad1e0cd4de071e13026" err="rpc error: code = NotFound desc = an error occurred when try to find container \"fdf5f59e2dde882b711124b9758e5945c513179afdc73ad1e0cd4de071e13026\": not found"
Sep 24 21:34:32 k3s-1 k3s[3394]: E0924 21:34:32.313676    3394 remote_runtime.go:625] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"81af9130a4196030a89317c7419735a7e79006d886f95e99a42c393f6c797841\": not found" containerID="81af9130a4196030a89317c7419735a7e79006d886f95e99a42c393f6c797841"
Sep 24 21:34:32 k3s-1 k3s[3394]: I0924 21:34:32.313787    3394 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="81af9130a4196030a89317c7419735a7e79006d886f95e99a42c393f6c797841" err="rpc error: code = NotFound desc = an error occurred when try to find container \"81af9130a4196030a89317c7419735a7e79006d886f95e99a42c393f6c797841\": not found"
Sep 24 21:34:32 k3s-1 k3s[3394]: E0924 21:34:32.318267    3394 remote_runtime.go:625] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"ab0ff68f27bd01fe90e127e3b6760289c9009a70289e131955eebd0ebc47129c\": not found" containerID="ab0ff68f27bd01fe90e127e3b6760289c9009a70289e131955eebd0ebc47129c"
Sep 24 21:34:32 k3s-1 k3s[3394]: I0924 21:34:32.318388    3394 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="ab0ff68f27bd01fe90e127e3b6760289c9009a70289e131955eebd0ebc47129c" err="rpc error: code = NotFound desc = an error occurred when try to find container \"ab0ff68f27bd01fe90e127e3b6760289c9009a70289e131955eebd0ebc47129c\": not found"
Sep 24 21:34:32 k3s-1 k3s[3394]: E0924 21:34:32.319482    3394 remote_runtime.go:625] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"99864d63eaee7a339c829ef337f2533d87fa477acadd5c04d6087036713ae140\": not found" containerID="99864d63eaee7a339c829ef337f2533d87fa477acadd5c04d6087036713ae140"
Sep 24 21:34:32 k3s-1 k3s[3394]: I0924 21:34:32.319596    3394 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="99864d63eaee7a339c829ef337f2533d87fa477acadd5c04d6087036713ae140" err="rpc error: code = NotFound desc = an error occurred when try to find container \"99864d63eaee7a339c829ef337f2533d87fa477acadd5c04d6087036713ae140\": not found"
Sep 24 21:34:32 k3s-1 k3s[3394]: E0924 21:34:32.320791    3394 remote_runtime.go:625] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"af2148977c522f9c5276333e5935fb009284c7af616bc9fefd88d882e033d5be\": not found" containerID="af2148977c522f9c5276333e5935fb009284c7af616bc9fefd88d882e033d5be"
Sep 24 21:34:32 k3s-1 k3s[3394]: I0924 21:34:32.320901    3394 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="af2148977c522f9c5276333e5935fb009284c7af616bc9fefd88d882e033d5be" err="rpc error: code = NotFound desc = an error occurred when try to find container \"af2148977c522f9c5276333e5935fb009284c7af616bc9fefd88d882e033d5be\": not found"
Sep 24 21:34:37 k3s-1 k3s[3394]: E0924 21:34:37.447729    3394 secret.go:192] Couldn't get secret metallb-system/memberlist: secret "memberlist" not found
Sep 24 21:34:37 k3s-1 k3s[3394]: E0924 21:34:37.447967    3394 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist podName:cefc73a3-6380-43f3-8b55-90c9beeedae1 nodeName:}" failed. No retries permitted until 2023-09-24 21:35:41.447927032 -0600 MDT m=+146.645340444 (durationBeforeRetry 1m4s). Error: MountVolume.SetUp failed for volume "memberlist" (UniqueName: "kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist") pod "speaker-wxc6m" (UID: "cefc73a3-6380-43f3-8b55-90c9beeedae1") : secret "memberlist" not found
Sep 24 21:35:01 k3s-1 k3s[3394]: E0924 21:35:01.038365    3394 dns.go:157] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.193.1.35 10.193.1.12 10.193.1.35"
Sep 24 21:35:36 k3s-1 k3s[3394]: E0924 21:35:36.572367    3394 kubelet.go:1731] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[memberlist], unattached volumes=[memberlist kube-api-access-tpv7s]: timed out waiting for the condition" pod="metallb-system/speaker-wxc6m"
Sep 24 21:35:36 k3s-1 k3s[3394]: E0924 21:35:36.572508    3394 pod_workers.go:965] "Error syncing pod, skipping" err="unmounted volumes=[memberlist], unattached volumes=[memberlist kube-api-access-tpv7s]: timed out waiting for the condition" pod="metallb-system/speaker-wxc6m" podUID=cefc73a3-6380-43f3-8b55-90c9beeedae1
Sep 24 21:35:41 k3s-1 k3s[3394]: E0924 21:35:41.544516    3394 secret.go:192] Couldn't get secret metallb-system/memberlist: secret "memberlist" not found
Sep 24 21:35:41 k3s-1 k3s[3394]: E0924 21:35:41.546423    3394 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist podName:cefc73a3-6380-43f3-8b55-90c9beeedae1 nodeName:}" failed. No retries permitted until 2023-09-24 21:37:43.546358301 -0600 MDT m=+268.743771731 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "memberlist" (UniqueName: "kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist") pod "speaker-wxc6m" (UID: "cefc73a3-6380-43f3-8b55-90c9beeedae1") : secret "memberlist" not found
Sep 24 21:36:11 k3s-1 k3s[3394]: E0924 21:36:11.038858    3394 dns.go:157] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.193.1.35 10.193.1.12 10.193.1.35"
Sep 24 21:37:43 k3s-1 k3s[3394]: E0924 21:37:43.577589    3394 secret.go:192] Couldn't get secret metallb-system/memberlist: secret "memberlist" not found
Sep 24 21:37:43 k3s-1 k3s[3394]: E0924 21:37:43.577922    3394 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist podName:cefc73a3-6380-43f3-8b55-90c9beeedae1 nodeName:}" failed. No retries permitted until 2023-09-24 21:39:45.577823361 -0600 MDT m=+390.775236858 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "memberlist" (UniqueName: "kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist") pod "speaker-wxc6m" (UID: "cefc73a3-6380-43f3-8b55-90c9beeedae1") : secret "memberlist" not found
Sep 24 21:37:54 k3s-1 k3s[3394]: E0924 21:37:54.040671    3394 kubelet.go:1731] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[memberlist], unattached volumes=[memberlist kube-api-access-tpv7s]: timed out waiting for the condition" pod="metallb-system/speaker-wxc6m"
Sep 24 21:37:54 k3s-1 k3s[3394]: E0924 21:37:54.044689    3394 pod_workers.go:965] "Error syncing pod, skipping" err="unmounted volumes=[memberlist], unattached volumes=[memberlist kube-api-access-tpv7s]: timed out waiting for the condition" pod="metallb-system/speaker-wxc6m" podUID=cefc73a3-6380-43f3-8b55-90c9beeedae1
Sep 24 21:38:22 k3s-1 k3s[3394]: E0924 21:38:22.044813    3394 dns.go:157] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.193.1.35 10.193.1.12 10.193.1.35"
Sep 24 21:39:45 k3s-1 k3s[3394]: E0924 21:39:45.610639    3394 secret.go:192] Couldn't get secret metallb-system/memberlist: secret "memberlist" not found
Sep 24 21:39:45 k3s-1 k3s[3394]: E0924 21:39:45.610944    3394 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist podName:cefc73a3-6380-43f3-8b55-90c9beeedae1 nodeName:}" failed. No retries permitted until 2023-09-24 21:41:47.610881377 -0600 MDT m=+512.808294793 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "memberlist" (UniqueName: "kubernetes.io/secret/cefc73a3-6380-43f3-8b55-90c9beeedae1-memberlist") pod "speaker-wxc6m" (UID: "cefc73a3-6380-43f3-8b55-90c9beeedae1") : secret "memberlist" not found
Sep 24 21:40:11 k3s-1 k3s[3394]: E0924 21:40:11.043026    3394 kubelet.go:1731] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[memberlist], unattached volumes=[kube-api-access-tpv7s memberlist]: timed out waiting for the condition" pod="metallb-system/speaker-wxc6m"
Sep 24 21:40:11 k3s-1 k3s[3394]: E0924 21:40:11.044014    3394 pod_workers.go:965] "Error syncing pod, skipping" err="unmounted volumes=[memberlist], unattached volumes=[kube-api-access-tpv7s memberlist]: timed out waiting for the condition" pod="metallb-system/speaker-wxc6m" podUID=cefc73a3-6380-43f3-8b55-90c9beeedae1

On the worker node

Sep 24 21:08:12 k3s-4 k3s[1483]: time="2023-09-24T21:08:12-06:00" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Sep 24 21:08:12 k3s-4 k3s[1483]: time="2023-09-24T21:08:12-06:00" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/3cdacaf539fc388d8e542a8d643948e3c7bfa4a7e91b7521102325e0ce8581b6"
Sep 24 21:08:17 k3s-4 k3s[1483]: time="2023-09-24T21:08:17-06:00" level=info msg="Starting k3s agent v1.25.12+k3s1 (7515237f)"
Sep 24 21:08:17 k3s-4 k3s[1483]: time="2023-09-24T21:08:17-06:00" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.193.20.10:6443"
Sep 24 21:08:17 k3s-4 k3s[1483]: time="2023-09-24T21:08:17-06:00" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [10.193.20.10:6443] [default: 10.193.20.10:6443]"
Sep 24 21:08:23 k3s-4 k3s[1483]: time="2023-09-24T21:08:23-06:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:59380->127.0.0.1:6444: read: connection reset by peer"
Sep 24 21:08:31 k3s-4 k3s[1483]: time="2023-09-24T21:08:31-06:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:58798->127.0.0.1:6444: read: connection reset by peer"
Sep 24 21:08:39 k3s-4 k3s[1483]: time="2023-09-24T21:08:39-06:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:50748->127.0.0.1:6444: read: connection reset by peer"
Sep 24 21:08:48 k3s-4 k3s[1483]: time="2023-09-24T21:08:48-06:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:50776->127.0.0.1:6444: read: connection reset by peer"
Sep 24 21:08:56 k3s-4 k3s[1483]: time="2023-09-24T21:08:56-06:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:50578->127.0.0.1:6444: read: connection reset by peer"

Possible Solution