k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.65k stars 2.32k forks source link

Add pprof capability to k3s server? #1635

Closed TylerBrock closed 2 years ago

TylerBrock commented 4 years ago

Hi all, I'm running k3s on x86_64 as well as armv7 (raspberry pi 4) and I've noticed that there is a consistent amount of CPU in use after starting the service while it should just be idle. One core always take about 15-20% cpu with no pods of my own running.

Additionally, every ~5 seconds it uses ~25% of all four cores on both boxes to do something (I think this is collecting metrics most likely).

I'm not sure if this is a kubernetes problem in general but for something like k3s I'd like to figure out what is going on (since it's supposed to be lightweight).

Would anyone object to adding pprof to k3s to see exactly what it's doing with that constant CPU?

Some additional details:

runningman84 commented 4 years ago

this is a similar issue: https://github.com/rancher/k3s/issues/294

TylerBrock commented 4 years ago

Yup I read through that and it seems like there was no outcome/solution so I wanted to try and move things forward by seeing if anyone is against this. I would be willing to try to add pprof support via a PR but I want to know my time won't be a complete waste if folks are against it / or if they have profiled before and know what the problem is.

runningman84 commented 4 years ago

I would be very happy to see the root cause for this issue...

@erikwilson what do you guys think about it?

MichaelKissling commented 4 years ago

Hello @TylerBrock

I am also very interested in the result. Currently we have the same problems with k3s-server using about 20-30 % CPU on a fresh Lubuntu 20.04 installation. It would be great to find and remove the cause.

regards Michael

stale[bot] commented 3 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

sysadmiral commented 3 years ago

not stale. Still happens in 2021 running v1.21.3+k3s1 on a pi cluster. No other pods running. k3s process is constantly consuming about 20-30% CPU.

In my cluster I stopped k3s on the primary and the agent node that is showing the consistent CPU usage dropped down. When the primary came back up the CPU went back up on the node agent.

image

pfisterer commented 2 years ago

I also see 10-13% consistently on a fresh install on Debian 11 w/o any workload.

brandond commented 2 years ago

Yes, that is to be expected. You might check out the expected resource figures at:

https://rancher.com/docs/k3s/latest/en/installation/installation-requirements/resource-profiling/

dereknola commented 2 years ago

PR was merged https://github.com/k3s-io/k3s/pull/5527 to address this

rancher-max commented 2 years ago

Validated using master branch commitid 2968a83bc046f43be62852b94d07c531de97c416

Environment Details:

Node(s) CPU architecture, OS, and Version:

Ubuntu 22.04 LTS
Validated both x86_64 and aarch64:
$ uname -a
Linux ip-172-31-47-67 5.15.0-1011-aws #14-Ubuntu SMP Wed Jun 1 20:54:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ uname -a
Linux ip-172-31-26-162 5.15.0-1009-aws #11-Ubuntu SMP Thu May 26 19:39:49 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration: 1 server Also confirmed this flag does not apply to agents

Config.yaml:

# configuration 1 - enable pprof:
write-kubeconfig-mode: 644
enable-pprof: true

# configuration 2 - default (pprof disabled):
write-kubeconfig-mode: 644

Validated with:

K3S Version/Commit: commit 2968a83bc046f43be62852b94d07c531de97c416 curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=2968a83bc046f43be62852b94d07c531de97c416 sh -

Steps To Validate:

  1. Setup config.yaml using configuration 1
  2. Installed k3s using curl command
  3. Grabbed the profile by curling the correct endpoint: curl --insecure https://localhost:6443/debug/pprof/profile > profile.pprof
  4. Grabbed the trace output by curling the correct endpoint: curl --insecure https://localhost:6443/debug/pprof/trace > trace.out
  5. Access the trace output
  6. Access the profile
  7. Perform the same steps 1-6 using configuration 2 -- pprof should be disabled.

Results:

  1. When accessing the trace output (step 5 above): note that I am actually unable to access the endpoint here, and I don't have a browser available in my VM, so I'm not 100% sure this part is working as expected, but it seems like it's doing what is expected to me so I am considering it okay.
    $ go tool trace trace.out 
    2022/06/16 18:18:08 Parsing trace...
    2022/06/16 18:18:08 Splitting trace...
    2022/06/16 18:18:08 Opening browser. Trace viewer is listening on http://127.0.0.1:40013
  2. When accessing the profile (step 6 above): There may be some utilities that fail due to expected utilities not being installed on the host. This is intentional and utilities should be installed by the operator as needed.
    
    $ go tool pprof profile.pprof 
    File: k3s
    Type: cpu
    Time: Jun 16, 2022 at 6:08pm (UTC)
    Duration: 30s, Total samples = 1.69s ( 5.63%)
    Entering interactive mode (type "help" for commands, "o" for options)
    (pprof) 

(pprof) top Showing nodes accounting for 860ms, 50.89% of 1690ms total Showing top 10 nodes out of 511 flat flat% sum% cum cum% 160ms 9.47% 9.47% 160ms 9.47% runtime.futex 130ms 7.69% 17.16% 130ms 7.69% syscall.Syscall6 100ms 5.92% 23.08% 100ms 5.92% runtime.epollwait 80ms 4.73% 27.81% 210ms 12.43% github.com/prometheus/client_golang/prometheus.(MetricVec).hashLabelValues 80ms 4.73% 32.54% 80ms 4.73% syscall.Syscall 70ms 4.14% 36.69% 190ms 11.24% github.com/prometheus/client_golang/prometheus.(histogram).Observe 70ms 4.14% 40.83% 70ms 4.14% github.com/prometheus/client_golang/prometheus.(*histogram).observe (inline) 70ms 4.14% 44.97% 70ms 4.14% runtime.write1 50ms 2.96% 47.93% 50ms 2.96% github.com/prometheus/client_golang/prometheus.hashAdd 50ms 2.96% 50.89% 60ms 3.55% runtime.cgocall

(pprof) png failed to execute dot. Is Graphviz installed? Error: exec: "dot": executable file not found in $PATH

Exited pprof and installed graphviz. For example on ubuntu:

$ sudo apt update && sudo apt install graphviz

Went back into pprof and this worked:

(pprof) png Generating report in profile001.png

Gif output from x86:
![profile001](https://user-images.githubusercontent.com/55997940/174147453-b0b4fe17-5344-4546-8526-4e04cf984fd9.gif)

PNG output from aarch64
![profile001](https://user-images.githubusercontent.com/55997940/174147462-0a9ad76d-5d3b-431d-8057-be6c018c34f3.png)

3. When pprof is disabled, cannot access any outputs as expected:

$ go tool pprof profile.pprof profile.pprof: parsing profile: unrecognized profile format failed to fetch any source profiles


**Additional context / logs:**
1. When trying pprof on the endpoint directly, it fails as expected:

$ go tool pprof https://localhost:6443/debug/pprof/profile Fetching profile over HTTP from https://localhost:6443/debug/pprof/profile https://localhost:6443/debug/pprof/profile: Get "https://localhost:6443/debug/pprof/profile": x509: certificate signed by unknown authority failed to fetch any source profiles

2. The other endpoints have different outputs and do not give pprof data:

$ curl --insecure https://localhost:6443/debug/pprof/cmdline > cmdline.pprof $ go tool pprof cmdline.pprof cmdline.pprof: parsing profile: unrecognized profile format failed to fetch any source profiles

3. Help text is sufficient:

$ k3s server --help | grep -i pprof --enable-pprof (experimental) Enable pprof endpoint on supervisor port