Waiting to retrieve kube-proxy configuration; server is not ready - no pods running

pkoryzna commented 1 year ago

Environmental Info: K3s Version:

k3s version v1.27.1+k3s1 (bc5b42c2) 
go version go1.20.3

Node(s) CPU architecture, OS, and Version:

Linux debian 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

Cluster Configuration: 1 node

Describe the bug:

None of the workloads (in kube-system or any of my namespaces) start after system reboot or systemctl restart k3s

Steps To Reproduce:

rebooted the system, no pods start anymore
upgraded k3s, no change

Installed K3s: curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=latest sh -

Expected behavior:

at least the defaults in kube-system running

Actual behavior:

Nothing starts.

patryk@debian:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                             READY   STATUS      RESTARTS   AGE
kube-system   helm-install-traefik-crd-q7bn4   0/1     Completed   0          318d
kube-system   helm-install-traefik-7jnsj       0/1     Completed   1          318d

These messages about server not being ready keep appearing in the logs:

INFO[0406] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error
I0519 18:41:52.460033   94158 handler_discovery.go:325] DiscoveryManager: Failed to download discovery for kube-system/metrics-server:443: 503 request timed out
I0519 18:41:52.460199   94158 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
E0519 18:41:52.475328   94158 available_controller.go:531] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.42.0.20:4443/apis/metrics.k8s.io/v1beta1: Get "https://10.42.0.20:4443/apis/metrics.k8s.io/v1beta1": context deadline exceeded

Additional context / logs: attached k3s server --debug logs: k3s-server.log

brandond commented 1 year ago

Have you added any other CLI flags or config file entries, other than --debug?

Is there anything unusual about this node? Are any of your filesystems on a remote share, ephemeral, or on a transactional update system?

pkoryzna commented 1 year ago

Sorry for not mentioning it - you're correct, I obtained those logs by running sudo k3s server --debug 2>&1 | tee k3s-server.log in hopes of seeing some more details. Nothing unusual about the node as far as I can tell, the machine is a physical amd64 box. I used to have the /var/lib/rancher/k3s mounted on an iSCSI device but I have moved it to the internal SATA SSD a few weeks ago after installing a larger drive and it ran without any problems (commented out the fstab entry after that)

brandond commented 1 year ago

Did that perhaps not get set up properly after the reboot? Can you confirm that you've got the expected contents and mounts at that path, and nothing is being mounted there now? It feels very much like the mount is being added halfway though K3s starting up, and a bunch of content is missing.

pkoryzna commented 1 year ago

Thank you for the suggestion. Just checked - I can confirm there are no iSCSI mounts on this system anymore, everything is on the logical volumes in VG on sda which is the SATA SSD inside the machine. The directories in /var/lib/rancher/k3s/storage do correspond to the PVCs I had set up. The content inside is also what I would expect.

patryk@debian:~$ lsblk
NAME                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                     8:0    0 55.9G  0 disk
|-sda1                  8:1    0  512M  0 part /boot
`-sda2                  8:2    0 55.4G  0 part
  |-debian--vg-root   254:0    0 54.4G  0 lvm  /
  `-debian--vg-swap_1 254:1    0  980M  0 lvm  [SWAP]

patryk@debian:~$ sudo ls /var/lib/rancher/k3s
agent  data  server  storage

patryk@debian:~$ sudo ls /var/lib/rancher/k3s/storage/
pvc-3ff2b39f-a108-405e-8e28-af1a093856af_3dprint_octoprint-vol-octoprint-0
pvc-5d8410e7-719c-4d90-a02e-463f4db4bde6_3dprint_octoprint-vol-octo-octoprint-0

patryk@debian:~$ df /var/lib/rancher/k3s
Filesystem                  1K-blocks     Used Available Use% Mounted on
/dev/mapper/debian--vg-root  56098784 17120284  36625240  32% /

patryk@debian:~$ sudo iscsiadm -m session
iscsiadm: No active sessions.

When I was using mounted iSCSI volume I had it added as k3s.service's dependency in systemd, which I had also commented out right after moving the data to the local volume. (I assume systemd wouldn't start the service if it was still depending on the mount, and the k3s.service itself gets started automatically without any manual intervention from my side)

Just to be sure I removed the drop-ins completely with sudo systemctl revert k3s.service and restarted - still the having same issue and similar error messages in the logs, even without --debug:

May 20 00:05:06 debian k3s[1554]: time="2023-05-20T00:05:06+02:00" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
May 20 00:05:07 debian k3s[1554]: time="2023-05-20T00:05:07+02:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
May 20 00:05:11 debian k3s[1554]: time="2023-05-20T00:05:11+02:00" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
May 20 00:05:12 debian k3s[1554]: time="2023-05-20T00:05:12+02:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
May 20 00:05:15 debian k3s[1554]: time="2023-05-20T00:05:15+02:00" level=info msg="Waiting for API server to become available"
May 20 00:05:16 debian k3s[1554]: time="2023-05-20T00:05:16+02:00" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
May 20 00:05:16 debian k3s[1554]: time="2023-05-20T00:05:16+02:00" level=info msg="Waiting for API server to become available"
May 20 00:05:17 debian k3s[1554]: time="2023-05-20T00:05:17+02:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
May 20 00:05:18 debian k3s[1554]: I0520 00:05:18.238856    1554 trace.go:219] Trace[1208218320]: "Proxy via http_connect protocol over tcp" address:10.42.0.20:4443 (20-May-2023 00:03:08.434) (total time: 129803ms):
May 20 00:05:18 debian k3s[1554]: Trace[1208218320]: [2m9.80398354s] [2m9.80398354s] END
May 20 00:05:18 debian k3s[1554]: I0520 00:05:18.242742    1554 trace.go:219] Trace[1418202149]: "Proxy via http_connect protocol over tcp" address:10.42.0.20:4443 (20-May-2023 00:03:08.434) (total time: 129808ms):
May 20 00:05:18 debian k3s[1554]: Trace[1418202149]: [2m9.808432896s] [2m9.808432896s] END
May 20 00:05:18 debian k3s[1554]: I0520 00:05:18.242742    1554 trace.go:219] Trace[357151497]: "Proxy via http_connect protocol over tcp" address:10.42.0.20:4443 (20-May-2023 00:03:08.434) (total time: 129807ms):
May 20 00:05:18 debian k3s[1554]: Trace[357151497]: [2m9.807774284s] [2m9.807774284s] END
May 20 00:05:18 debian k3s[1554]: I0520 00:05:18.242937    1554 trace.go:219] Trace[25518149]: "Proxy via http_connect protocol over tcp" address:10.42.0.20:4443 (20-May-2023 00:03:08.434) (total time: 129808ms):
May 20 00:05:18 debian k3s[1554]: Trace[25518149]: [2m9.808585431s] [2m9.808585431s] END
May 20 00:05:18 debian k3s[1554]: I0520 00:05:18.243011    1554 trace.go:219] Trace[714593470]: "Proxy via http_connect protocol over tcp" address:10.42.0.20:4443 (20-May-2023 00:03:08.434) (total time: 129808ms):
May 20 00:05:18 debian k3s[1554]: Trace[714593470]: [2m9.808736988s] [2m9.808736988s] END

Let me know if there's anything else I could check!

brandond commented 1 year ago

Can you kubectl get addon -A and ls -la /var/lib/rancher/k3s/server/manifests/ ? You appear to be missing all the packaged components.

pkoryzna commented 1 year ago

Seems like I have some stuff deployed, but nothing actually running 🤔

patryk@debian:~$ kubectl get addon -A
NAMESPACE     NAME                        AGE
kube-system   ccm                         319d
kube-system   coredns                     319d
kube-system   local-storage               319d
kube-system   aggregated-metrics-reader   319d
kube-system   auth-delegator              319d
kube-system   auth-reader                 319d
kube-system   metrics-apiservice          319d
kube-system   metrics-server-deployment   319d
kube-system   metrics-server-service      319d
kube-system   resource-reader             319d
kube-system   rolebindings                319d
kube-system   traefik                     319d

There are manifests at that path, including the metrics-server which seems to be somehow not deployed properly

patryk@debian:~$ sudo ls -laR /var/lib/rancher/k3s/server/manifests/
/var/lib/rancher/k3s/server/manifests/:
total 36
drwx------ 3 root root 4096 May 19 18:20 .
drwx------ 8 root root 4096 May 20 00:08 ..
-rw------- 1 root root 1774 May 15 22:42 ccm.yaml
-rw------- 1 root root 4857 May 15 22:42 coredns.yaml
-rw------- 1 root root 3635 May 15 22:42 local-storage.yaml
drwx------ 2 root root 4096 Apr  1 14:28 metrics-server
-rw------- 1 root root 1039 May 15 22:42 rolebindings.yaml
-rw------- 1 root root 1155 May 15 22:42 traefik.yaml

/var/lib/rancher/k3s/server/manifests/metrics-server:
total 36
drwx------ 2 root root 4096 Apr  1 14:28 .
drwx------ 3 root root 4096 May 19 18:20 ..
-rw------- 1 root root  393 May 15 22:42 aggregated-metrics-reader.yaml
-rw------- 1 root root  303 May 15 22:42 auth-delegator.yaml
-rw------- 1 root root  324 May 15 22:42 auth-reader.yaml
-rw------- 1 root root  293 May 15 22:42 metrics-apiservice.yaml
-rw------- 1 root root 2217 May 15 22:42 metrics-server-deployment.yaml
-rw------- 1 root root  309 May 15 22:42 metrics-server-service.yaml
-rw------- 1 root root  517 May 15 22:42 resource-reader.yaml

brandond commented 1 year ago

Hmm, I'm kind of at a loss. Have you tried stopping k3s, mounting the volume again, and then starting it again to see if perhaps some of the data was missed when you migrated off? I don't see any critical errors but clearly something is missing.

pkoryzna commented 1 year ago

Yeah, this is very confusing indeed. I do not have a copy of the volume anymore, I removed it after (seemingly successfully) migrating so can't check anymore.

Tonight I did an apt update && apt upgrade && reboot, removed k3s, reinstalled it from stable channel, reinstalled my helm charts and everything works as expected. I'm afraid I won't be able to reproduce the issue, my guess would be some sneaky filesystem corruption on the local SSD? Might be a good idea to check the SMART metrics soon 😅

k3s-io / k3s

Waiting to retrieve kube-proxy configuration; server is not ready - no pods running #7584