k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.08k stars 2.35k forks source link

Crash loop due to apiserver healthcheck error with poststarthook/rbac/bootstrap-roles #2425

Closed ChipWolf closed 2 years ago

ChipWolf commented 4 years ago

Environmental Info: K3s Version: k3s version v1.19.2+k3s1 (d38505b1)

Node(s) CPU architecture, OS, and Version: Linux k3os-16698 5.4.0-48-generic #52 SMP Sat Sep 26 08:27:15 UTC 2020 x86_64 GNU/Linux

Cluster Configuration: single-node

Describe the bug:

Shortly after k3s starts, a failure with apiserver triggers a crash, the log below shows poststarthook/rbac/bootstrap-roles failed: reason withheld, this is followed by a fairly large traceback and the service starts again.

Failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/bootstrap-controller ok\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding

Steps To Reproduce:

The last thing to change in the cluster was disabling the deployment of coredns --no-deploy coredns in favor of an existing helm installation of coredns. I don't see a connection between this new failure mode and coredns. However, I'm unable to identify exactly what triggered this failure, I can provide the coredns configuration if required.

Expected behavior:

K3s should start without apiserver health check errors and subsequent crashing.

Actual behavior:

K3s fails to fully instantiate due to the apiserver health check failure and subsequent crashloop.

Additional context / logs:

https://gist.githubusercontent.com/ChipWolf/a3a066e893643a43e2a182e0fa494eab/raw/8cc06511e481ced4809b3a99f96571be67e68e0c/k3s-service.log

ChipWolf commented 4 years ago

Removed --no-deploy coredns, same issue. Not sure how to recover from this. Fortunately it's an IaC gitops deployment, it would be good to figure out what is causing this though.

ChipWolf commented 4 years ago

I am unable to reproduce this, so I'll close for now.

ChipWolf commented 4 years ago

Same issue again today, while Googling for an answer... I stumble across my own issue I did reproduce it, but I have no clue what triggers the issue.

radhus commented 3 years ago

I ended up having the same issue after an upgrade. But after testing I realized that after multiple restarts of k3s, it sometimes starts without crashing. All crashing logs ends with the same line:

F1128 22:13:25.702551    9704 controllermanager.go:262] Failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/bootstrap-controller ok\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding

After disabling cloud-controller-manager by adding --disable-cloud-controller, k3s now starts every time without crashing. So perhaps there is a race condition in some way, cloud-controller-manager is perhaps starting too early and the 10 second timeout is too short?

cc @brandond that I saw mention this in #2477

billimek commented 3 years ago

I'm observing the same issue with k3s v1.19.7+k3s1 (using embedded sqllite FWIW) on an amd64 node (with 4 cores and 8GB memory). For completeness, posting a snippet of the logs here as well.

Like @ChipWolf notes, I also observe that after several iterations of startup & subsequent crashing, it eventually 'works' and stays running and stable until the next reboot.

Feb  9 18:17:23 k3s-0 k3s[5937]: F0209 18:17:23.032755    5937 controllermanager.go:262] Failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: an error on the server ("[+]ping ok\n[+]logok\n[+]etcd ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/max-in-flight-filter ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/bootstrap-controller ok\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding
brandond commented 3 years ago

@billimek can you attach a complete log from K3s startup onwards? I don't think I've seen this without there being some other issue with datastore latency.

billimek commented 3 years ago

@billimek can you attach a complete log from K3s startup onwards? I don't think I've seen this without there being some other issue with datastore latency.

Here we go: https://gist.github.com/billimek/064f85c0d5f83e85ab2feb7c7bcbf9a7

The interesting parts start around log timestamp Feb 09 18:15:45. I include probably too much previous logging but want to give as much context as possible.

ChipWolf commented 3 years ago

The only correlation I've found (could be purely coincidential) is if a node normally has a relatively high resource utilization; on reboot, it'll cap the CPU & crash loop.

Adding this to my k3s server flags and adding a few more nodes to further distribute the same workload seems to keep the issue away for now --kube-apiserver-arg default-not-ready-toleration-seconds=10 --kube-apiserver-arg default-unreachable-toleration-seconds=10.

The caveat with my bandaid is that I'm running a significantly less efficient cluster with underutilized nodes.

billimek commented 3 years ago

FWIW, this is a summary view of CPU,load,memory of my node post reboot (indicated by the arrow) from the logs posted above:

image

brandond commented 3 years ago

The cloud-controller-manager stuff has been reorganized a bit upstream as of 1.20 - has anyone seen this same issue on any of the 1.20 releases?

I will note that although cloud-controller-manager is what's ultimately blowing up, there is something that's causing the apiserver's rbac role boostrap (run on every startup to ensure that system-critical RBAC resources exist) to either fail or take too long to start up. Unfortunately the reason for the failure is masked at the default log level. Would it be possible for someone experiencing this problem to start up K3s with --v=2 added to your k3s server flags?

clambin commented 3 years ago

k3s.log

Still getting this on k3s 1.20. 4+k3s1, running on a Raspberry Pi 4 w/ 4GB. consistently happens on server restart. if I give it long enough, it will eventually start up. This evening 35 consecutive restarts and counting ...

Log attached, with -v==2 added in /etc/systemd/system/k3s.service.

brandond commented 3 years ago

I continue to see the same thing: something in the core apiserver startup code that ensures all the cluster critical RBAC takes an excessively long amount of time on hosts with slow CPU and disk IO. On your host, basic datastore operations are taking several seconds:

Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.357225    4562 trace.go:205] Trace[1362090208]: "Get" url:/apis/rbac.authorization.k8s.io/v1/clusterroles/cluster-admin,user-agent:k3s/v1.20.4+k3s1 (linux/arm) kubernetes/838a906,client:127.0.0.1 (22-Mar-2021 23:02:22.879) (total time: 1477ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[1362090208]: ---"About to write a response" 1476ms (23:02:00.356)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[1362090208]: [1.477146999s] [1.477146999s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.738373    4562 trace.go:205] Trace[2075772655]: "List etcd3" key:/secrets/media,resourceVersion:,resourceVersionMatch:,limit:0,continue: (22-Mar-2021 23:02:23.184) (total time: 1554ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[2075772655]: [1.554085554s] [1.554085554s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.784312    4562 trace.go:205] Trace[607658570]: "GuaranteedUpdate etcd3" type:*core.Event (22-Mar-2021 23:02:22.937) (total time: 1846ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[607658570]: ---"initial value restored" 1108ms (23:02:00.045)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[607658570]: ---"Transaction committed" 698ms (23:02:00.783)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[607658570]: [1.846476551s] [1.846476551s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.820537    4562 trace.go:205] Trace[132540766]: "Patch" url:/api/v1/namespaces/monitoring/events/monitoring-prometheus-node-exporter-t5fhn.166ec930c9804181,user-agent:k3s/v1.20.4+k3s1 (linux/arm) kubernetes/838a906,client:127.0.0.1 (22-Mar-2021 23:02:22.853) (total time: 1966ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[132540766]: ---"Recorded the audit event" 83ms (23:02:00.937)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[132540766]: ---"About to apply patch" 1108ms (23:02:00.045)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[132540766]: ---"Object stored in database" 699ms (23:02:00.784)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[132540766]: [1.966605259s] [1.966605259s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: I0322 23:02:24.857252    4562 trace.go:205] Trace[360291281]: "List" url:/api/v1/namespaces/media/secrets,user-agent:helm-operator/v0.0.0 (linux/arm) kubernetes/$Format,client:192.168.0.10 (22-Mar-2021 23:02:23.183) (total time: 1673ms):
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[360291281]: ---"Listing from storage done" 1556ms (23:02:00.740)
Mar 22 23:02:24 raspberrypi1 k3s[4562]: Trace[360291281]: [1.673183598s] [1.673183598s] END
Mar 22 23:02:24 raspberrypi1 k3s[4562]: F0322 23:02:24.961119    4562 controllermanager.go:168] error building controller context: failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status:
clambin commented 3 years ago

well, it is a raspberry pi. :-) but then it's been running fine for months. only started having problems once I moved beyond v1.18. Confused that it's referring to etcd: I thought it was using sqlite as internal db?

brandond commented 3 years ago

@clambin can you try again with --kube-apiserver-arg=v=2 - it doesn't look like the log has anything useful in it, specific to the rbac bootstrap health check failure.

As far as Kubernetes is concerned, Kine is an etcd server. It just happens to use a SQL backend.

clambin commented 3 years ago

here you go.

k3s.log

brandond commented 3 years ago

Hmm, seeing two poststarthooks failing to finish here:

I0325 12:45:54.111092    6845 healthz.go:244] poststarthook/start-apiextensions-controllers,poststarthook/crd-informer-synced,poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/scheduling/bootstrap-system-priority-classes,poststarthook/apiservice-registration-controller,autoregister-completion check failed: healthz
[-]poststarthook/start-apiextensions-controllers failed: not finished
[-]poststarthook/crd-informer-synced failed: not finished
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/scheduling/bootstrap-system-priority-classes failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:55.534660    6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:55.899645    6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:56.246184    6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:56.463343    6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:57.400340    6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:57.534746    6845 healthz.go:244] poststarthook/bootstrap-controller,poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/bootstrap-controller failed: not finished
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:58.521494    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:58.541441    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:59.526426    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:45:59.713831    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.038977    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.086097    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.086537    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.538786    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.592778    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: healthz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished
I0325 12:46:00.608981    6845 healthz.go:244] poststarthook/rbac/bootstrap-roles,poststarthook/apiservice-registration-controller check failed: readyz
[-]poststarthook/rbac/bootstrap-roles failed: not finished
[-]poststarthook/apiservice-registration-controller failed: not finished

Around the same time, datastore writes are taking multiple seconds to complete:

Trace[1089845604]: [4.141782501s] [4.141782501s] END
Trace[1362040333]: [8.087637975s] [8.087637975s] END
Trace[1065954533]: [3.626552762s] [3.626552762s] END
Trace[190730892]: [5.040242092s] [5.040242092s] END
Trace[1129193395]: [5.041031908s] [5.041031908s] END
Trace[393354043]: [7.635538056s] [7.635538056s] END
Trace[1429448011]: [7.636776825s] [7.636776825s] END
Trace[447841143]: [5.732564436s] [5.732564436s] END
Trace[95356746]: [1.687138997s] [1.687138997s] END
Trace[1113717490]: [1.772819012s] [1.772819012s] END
Trace[2043921204]: [5.817747555s] [5.817747555s] END
Trace[498634661]: [6.820461352s] [6.820461352s] END
Trace[1099537166]: [6.822579694s] [6.822579694s] END
Trace[888477776]: [5.291805666s] [5.291805666s] END
Trace[26378123]: [5.293546498s] [5.293546498s] END
Trace[2019969374]: [5.853455905s] [5.853455905s] END
Trace[478466535]: [5.854393978s] [5.854393978s] END
Trace[294992693]: [5.548645729s] [5.548645729s] END
Trace[1714041199]: [5.549684854s] [5.549684854s] END

I think newer versions of Kubernetes just try to do more on startup, and the additional load on the system causes it to take longer than it's willing to give itself. You might look at testing the IO throughput on your SD card, and picking up a faster one if possible. Switching to USB storage is another option.

AlessioCasco commented 3 years ago

I have the same issue on a cluster of 3 AWS m5.large ec2 machines. It seems that v1.18.8+k3s1 is the latest without the issue, I've also tried v1.18.13+k3s1, v1.19.5+k3s2 and v1.20.4+k3s1 without success.

brandond commented 3 years ago

@AlessioCasco I have not seen this issue present on anything except for very low-end arm hardware. Can you please open a new issue with information on your environment, and attach K3s logs from your nodes?

clambin commented 3 years ago

noticed in #2903 (which has the same error messages) that there's a discussion there on the speed of the SD card. Just for completeness: I switched the master node to booting off of a USB-mounted NVM drive (not as fast as an PCI-connected NVM drive, but still faster than the SD card), I still saw this error.

radhus commented 3 years ago

@brandond FYI the comment I posted above is running on normal/fast Intel x86-64, in case it matters that this has been around for non-low-end hardware since then.

clambin commented 3 years ago

one more data point: I'm only encountering this issue when restarting the (running) master k3s instance and the agent nodes have running pods. I just had another occurrence of this: I then restarted the agent k3s daemons and the server came up immediately.

I've never seen this happen when the cluster is started. Only when the master k3s is restarted.

jeremyroberts0 commented 3 years ago

I had this same issue with k3s on a pi4, rolled back to v1.18 and it seemed to solve the issue.

Edit: v1.18 didn't solve it, it started on first try but later crashed, seems consistent with others experience, I just got lucky on 1.18.

brandond commented 3 years ago

We have been unable to reproduce this locally, and since it's occurring somewhere in the core kubernetes apiserver startup code there may not be anything we can do about it other than opening an issue upstream.

jeremyroberts0 commented 3 years ago

I was able to solve the restart loop by adding flag --disable-cloud-controller to ExecStart in /etc/systemd/system/k3s.service. This seems to be preserved between restarts and upgrades and all my pods are running fine. I didn't create an external cloud controller, so for now it's running fine without one at all. I'm sure there's some negative implication here that I'm missing but for now it seems to meet my needs. Hopefully this helps anyone else running into this problem.

projx commented 3 years ago

I've just had the same issue, the master suddenly went a bit haywire, spewing exceptions into the syslog. @jeremyroberts0 that work-around did the trick, thanks.

1998729 commented 3 years ago

So what is the solution? 1.23 version Same problem

stale[bot] commented 2 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.