Closed ChipWolf closed 2 years ago
Removed --no-deploy coredns
, same issue. Not sure how to recover from this.
Fortunately it's an IaC gitops deployment, it would be good to figure out what is causing this though.
I am unable to reproduce this, so I'll close for now.
Same issue again today, while Googling for an answer... I stumble across my own issue I did reproduce it, but I have no clue what triggers the issue.
I ended up having the same issue after an upgrade. But after testing I realized that after multiple restarts of k3s, it sometimes starts without crashing. All crashing logs ends with the same line:
F1128 22:13:25.702551 9704 controllermanager.go:262] Failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/bootstrap-controller ok\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding
After disabling cloud-controller-manager by adding --disable-cloud-controller
, k3s now starts every time without crashing. So perhaps there is a race condition in some way, cloud-controller-manager is perhaps starting too early and the 10 second timeout is too short?
cc @brandond that I saw mention this in #2477
I'm observing the same issue with k3s v1.19.7+k3s1
(using embedded sqllite FWIW) on an amd64 node (with 4 cores and 8GB memory). For completeness, posting a snippet of the logs here as well.
Like @ChipWolf notes, I also observe that after several iterations of startup & subsequent crashing, it eventually 'works' and stays running and stable until the next reboot.
Feb 9 18:17:23 k3s-0 k3s[5937]: F0209 18:17:23.032755 5937 controllermanager.go:262] Failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: an error on the server ("[+]ping ok\n[+]logok\n[+]etcd ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/max-in-flight-filter ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/bootstrap-controller ok\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding
@billimek can you attach a complete log from K3s startup onwards? I don't think I've seen this without there being some other issue with datastore latency.
@billimek can you attach a complete log from K3s startup onwards? I don't think I've seen this without there being some other issue with datastore latency.
Here we go: https://gist.github.com/billimek/064f85c0d5f83e85ab2feb7c7bcbf9a7
The interesting parts start around log timestamp Feb 09 18:15:45. I include probably too much previous logging but want to give as much context as possible.
The only correlation I've found (could be purely coincidential) is if a node normally has a relatively high resource utilization; on reboot, it'll cap the CPU & crash loop.
Adding this to my k3s server flags and adding a few more nodes to further distribute the same workload seems to keep the issue away for now --kube-apiserver-arg default-not-ready-toleration-seconds=10 --kube-apiserver-arg default-unreachable-toleration-seconds=10
.
The caveat with my bandaid is that I'm running a significantly less efficient cluster with underutilized nodes.
FWIW, this is a summary view of CPU,load,memory of my node post reboot (indicated by the arrow) from the logs posted above:
The cloud-controller-manager stuff has been reorganized a bit upstream as of 1.20 - has anyone seen this same issue on any of the 1.20 releases?
I will note that although cloud-controller-manager is what's ultimately blowing up, there is something that's causing the apiserver's rbac role boostrap (run on every startup to ensure that system-critical RBAC resources exist) to either fail or take too long to start up. Unfortunately the reason for the failure is masked at the default log level. Would it be possible for someone experiencing this problem to start up K3s with --v=2
added to your k3s server
flags?
Still getting this on k3s 1.20. 4+k3s1, running on a Raspberry Pi 4 w/ 4GB. consistently happens on server restart. if I give it long enough, it will eventually start up. This evening 35 consecutive restarts and counting ...
Log attached, with -v==2 added in /etc/systemd/system/k3s.service.
I continue to see the same thing: something in the core apiserver startup code that ensures all the cluster critical RBAC takes an excessively long amount of time on hosts with slow CPU and disk IO. On your host, basic datastore operations are taking several seconds:
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.357225 4562 trace.go:205] Trace[1362090208]: "Get" url:/apis/rbac.authorization.k8s.io/v1/clusterroles/cluster-admin,user-agent:k3s/v1.20.4+k3s1 (linux/arm) kubernetes/838a906,client:127.0.0.1 (22-Mar-2021 23:02:22.879) (total time: 1477ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[1362090208]: ---"About to write a response" 1476ms (23:02:00.356)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[1362090208]: [1.477146999s] [1.477146999s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.738373 4562 trace.go:205] Trace[2075772655]: "List etcd3" key:/secrets/media,resourceVersion:,resourceVersionMatch:,limit:0,continue: (22-Mar-2021 23:02:23.184) (total time: 1554ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[2075772655]: [1.554085554s] [1.554085554s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.784312 4562 trace.go:205] Trace[607658570]: "GuaranteedUpdate etcd3" type:*core.Event (22-Mar-2021 23:02:22.937) (total time: 1846ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[607658570]: ---"initial value restored" 1108ms (23:02:00.045)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[607658570]: ---"Transaction committed" 698ms (23:02:00.783)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[607658570]: [1.846476551s] [1.846476551s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.820537 4562 trace.go:205] Trace[132540766]: "Patch" url:/api/v1/namespaces/monitoring/events/monitoring-prometheus-node-exporter-t5fhn.166ec930c9804181,user-agent:k3s/v1.20.4+k3s1 (linux/arm) kubernetes/838a906,client:127.0.0.1 (22-Mar-2021 23:02:22.853) (total time: 1966ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[132540766]: ---"Recorded the audit event" 83ms (23:02:00.937)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[132540766]: ---"About to apply patch" 1108ms (23:02:00.045)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[132540766]: ---"Object stored in database" 699ms (23:02:00.784)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[132540766]: [1.966605259s] [1.966605259s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.857252 4562 trace.go:205] Trace[360291281]: "List" url:/api/v1/namespaces/media/secrets,user-agent:helm-operator/v0.0.0 (linux/arm) kubernetes/$Format,client:192.168.0.10 (22-Mar-2021 23:02:23.183) (total time: 1673ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[360291281]: ---"Listing from storage done" 1556ms (23:02:00.740)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[360291281]: [1.673183598s] [1.673183598s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: F0322 23:02:24.961119 4562 controllermanager.go:168] error building controller context: failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status:
well, it is a raspberry pi. :-) but then it's been running fine for months. only started having problems once I moved beyond v1.18. Confused that it's referring to etcd: I thought it was using sqlite as internal db?
@clambin can you try again with --kube-apiserver-arg=v=2
- it doesn't look like the log has anything useful in it, specific to the rbac bootstrap health check failure.
As far as Kubernetes is concerned, Kine is an etcd server. It just happens to use a SQL backend.
Hmm, seeing two poststarthooks failing to finish here:
I0325 12:45:54.111092 6845 healthz.go:244] poststarthook/start-apiextensions-controllers,poststarthook/crd-informer-synced,poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/scheduling/bootstrap-system-priority-classes,poststarthook/apiservice-registration-controller,autoregister-completion check failed: healthz
[-]poststarthook/start-apiextensions-controllers failed: not finished
[-]poststarthook/crd-informer-synced failed: not finished
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/scheduling/bootstrap-system-priority-classes failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:55.534660 6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:55.899645 6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:56.246184 6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:56.463343 6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:57.400340 6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:57.534746 6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:58.521494 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:58.541441 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:59.526426 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:59.713831 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.038977 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.086097 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.086537 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.538786 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.592778 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.608981 6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
Around the same time, datastore writes are taking multiple seconds to complete:
Trace[1089845604]: [4.141782501s] [4.141782501s] END
Trace[1362040333]: [8.087637975s] [8.087637975s] END
Trace[1065954533]: [3.626552762s] [3.626552762s] END
Trace[190730892]: [5.040242092s] [5.040242092s] END
Trace[1129193395]: [5.041031908s] [5.041031908s] END
Trace[393354043]: [7.635538056s] [7.635538056s] END
Trace[1429448011]: [7.636776825s] [7.636776825s] END
Trace[447841143]: [5.732564436s] [5.732564436s] END
Trace[95356746]: [1.687138997s] [1.687138997s] END
Trace[1113717490]: [1.772819012s] [1.772819012s] END
Trace[2043921204]: [5.817747555s] [5.817747555s] END
Trace[498634661]: [6.820461352s] [6.820461352s] END
Trace[1099537166]: [6.822579694s] [6.822579694s] END
Trace[888477776]: [5.291805666s] [5.291805666s] END
Trace[26378123]: [5.293546498s] [5.293546498s] END
Trace[2019969374]: [5.853455905s] [5.853455905s] END
Trace[478466535]: [5.854393978s] [5.854393978s] END
Trace[294992693]: [5.548645729s] [5.548645729s] END
Trace[1714041199]: [5.549684854s] [5.549684854s] END
I think newer versions of Kubernetes just try to do more on startup, and the additional load on the system causes it to take longer than it's willing to give itself. You might look at testing the IO throughput on your SD card, and picking up a faster one if possible. Switching to USB storage is another option.
I have the same issue on a cluster of 3 AWS m5.large ec2 machines.
It seems that v1.18.8+k3s1
is the latest without the issue, I've also tried v1.18.13+k3s1
, v1.19.5+k3s2
and v1.20.4+k3s1
without success.
@AlessioCasco I have not seen this issue present on anything except for very low-end arm hardware. Can you please open a new issue with information on your environment, and attach K3s logs from your nodes?
noticed in #2903 (which has the same error messages) that there's a discussion there on the speed of the SD card. Just for completeness: I switched the master node to booting off of a USB-mounted NVM drive (not as fast as an PCI-connected NVM drive, but still faster than the SD card), I still saw this error.
@brandond FYI the comment I posted above is running on normal/fast Intel x86-64, in case it matters that this has been around for non-low-end hardware since then.
one more data point: I'm only encountering this issue when restarting the (running) master k3s instance and the agent nodes have running pods. I just had another occurrence of this: I then restarted the agent k3s daemons and the server came up immediately.
I've never seen this happen when the cluster is started. Only when the master k3s is restarted.
I had this same issue with k3s on a pi4, rolled back to v1.18 and it seemed to solve the issue.
Edit: v1.18 didn't solve it, it started on first try but later crashed, seems consistent with others experience, I just got lucky on 1.18.
We have been unable to reproduce this locally, and since it's occurring somewhere in the core kubernetes apiserver startup code there may not be anything we can do about it other than opening an issue upstream.
I was able to solve the restart loop by adding flag --disable-cloud-controller
to ExecStart
in /etc/systemd/system/k3s.service
. This seems to be preserved between restarts and upgrades and all my pods are running fine. I didn't create an external cloud controller, so for now it's running fine without one at all. I'm sure there's some negative implication here that I'm missing but for now it seems to meet my needs. Hopefully this helps anyone else running into this problem.
I've just had the same issue, the master suddenly went a bit haywire, spewing exceptions into the syslog. @jeremyroberts0 that work-around did the trick, thanks.
So what is the solution? 1.23 version Same problem
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
Environmental Info: K3s Version: k3s version v1.19.2+k3s1 (d38505b1)
Node(s) CPU architecture, OS, and Version: Linux k3os-16698 5.4.0-48-generic #52 SMP Sat Sep 26 08:27:15 UTC 2020 x86_64 GNU/Linux
Cluster Configuration: single-node
Describe the bug:
Shortly after k3s starts, a failure with apiserver triggers a crash, the log below shows
poststarthook/rbac/bootstrap-roles failed: reason withheld
, this is followed by a fairly large traceback and the service starts again.Steps To Reproduce:
The last thing to change in the cluster was disabling the deployment of coredns
--no-deploy coredns
in favor of an existing helm installation of coredns. I don't see a connection between this new failure mode and coredns. However, I'm unable to identify exactly what triggered this failure, I can provide the coredns configuration if required.Expected behavior:
K3s should start without apiserver health check errors and subsequent crashing.
Actual behavior:
K3s fails to fully instantiate due to the apiserver health check failure and subsequent crashloop.
Additional context / logs:
https://gist.githubusercontent.com/ChipWolf/a3a066e893643a43e2a182e0fa494eab/raw/8cc06511e481ced4809b3a99f96571be67e68e0c/k3s-service.log