google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.92k stars 2.31k forks source link

Reducing cAdvisor memory usage #2045

Open SamSaffron opened 5 years ago

SamSaffron commented 5 years ago

We have noticed that resource usage (both CPU and memory) is less than ideal on cadvisor.

In particular on a machine with 40 or so containers we see cadvisor ramp up 100MB RSS fairly quickly. CPU is also highish.

From memory dumps I isolated the about 50% of memory usage is due to container disk IO stats:

(pprof) top50
Showing nodes accounting for 34021.71kB, 100% of 34021.71kB total
Showing top 50 nodes out of 73
      flat  flat%   sum%        cum   cum%
10754.13kB 31.61% 31.61% 14850.68kB 43.65%  github.com/google/cadvisor/container/libcontainer.DiskStatsCopy
 7680.15kB 22.57% 54.18%  7680.15kB 22.57%  bufio.(*Scanner).Text (inline)
 3074.62kB  9.04% 63.22% 17925.30kB 52.69%  github.com/google/cadvisor/container/libcontainer.newContainerStats
 2048.45kB  6.02% 69.24%  2048.45kB  6.02%  github.com/google/cadvisor/container/libcontainer.DiskStatsCopy1
 2048.09kB  6.02% 75.26%  2048.09kB  6.02%  github.com/google/cadvisor/container/libcontainer.DiskStatsCopy0 (inline)
 1536.23kB  4.52% 79.78%  1536.23kB  4.52%  github.com/google/cadvisor/vendor/github.com/aws/aws-sdk-go/aws/endpoints.init
 1184.27kB  3.48% 83.26%  1184.27kB  3.48%  bytes.makeSlice
 1030.64kB  3.03% 86.29%  1030.64kB  3.03%  regexp/syntax.(*compiler).inst (inline)
  561.50kB  1.65% 87.94%   561.50kB  1.65%  html.init
  516.01kB  1.52% 89.45%   516.01kB  1.52%  github.com/google/cadvisor/vendor/golang.org/x/net/trace.init
  514.63kB  1.51% 90.97%   514.63kB  1.51%  math/rand.NewSource
  512.56kB  1.51% 92.47%   512.56kB  1.51%  compress/flate.newHuffmanEncoder (inline)
  512.31kB  1.51% 93.98%  1073.81kB  3.16%  html/template.init
  512.05kB  1.51% 95.49%   512.05kB  1.51%  github.com/google/cadvisor/vendor/golang.org/x/exp/inotify.(*Watcher).readEvents
  512.02kB  1.50% 96.99%   512.02kB  1.50%  github.com/google/cadvisor/summary.(*StatsSummary).AddSample
  512.02kB  1.50% 98.50%   512.02kB  1.50%  github.com/google/cadvisor/vendor/golang.org/x/net/http2/hpack.addDecoderNode
  512.02kB  1.50%   100%   512.02kB  1.50%  vendor/golang_org/x/net/http2/hpack.addDecoderNode
         0     0%   100%  1184.27kB  3.48%  bytes.(*Buffer).ReadFrom
         0     0%   100%  1184.27kB  3.48%  bytes.(*Buffer).grow
         0     0%   100%   512.56kB  1.51%  compress/flate.generateFixedLiteralEncoding
         0     0%   100%   512.56kB  1.51%  compress/flate.init
         0     0%   100%   512.56kB  1.51%  compress/gzip.init
         0     0%   100%   514.63kB  1.51%  crypto/rsa.init
         0     0%   100%   514.63kB  1.51%  crypto/tls.init
         0     0%   100%   514.63kB  1.51%  crypto/x509.init
         0     0%   100%  3077.57kB  9.05%  github.com/google/cadvisor/api.init
         0     0%   100%  1028.03kB  3.02%  github.com/google/cadvisor/container/containerd.init
         0     0%   100%  2049.54kB  6.02%  github.com/google/cadvisor/container/docker.init
         0     0%   100% 25605.45kB 75.26%  github.com/google/cadvisor/container/libcontainer.(*Handler).GetStats
         0     0%   100% 14850.68kB 43.65%  github.com/google/cadvisor/container/libcontainer.setDiskIoStats
         0     0%   100% 25605.45kB 75.26%  github.com/google/cadvisor/container/raw.(*rawContainerHandler).GetStats
         0     0%   100%  4261.84kB 12.53%  github.com/google/cadvisor/http.init
         0     0%   100%  1536.23kB  4.52%  github.com/google/cadvisor/machine.init
         0     0%   100% 26117.48kB 76.77%  github.com/google/cadvisor/manager.(*containerData).housekeeping
         0     0%   100% 26117.48kB 76.77%  github.com/google/cadvisor/manager.(*containerData).housekeepingTick
         0     0%   100% 26117.48kB 76.77%  github.com/google/cadvisor/manager.(*containerData).updateStats
         0     0%   100%  3077.57kB  9.05%  github.com/google/cadvisor/manager.init
         0     0%   100%  1184.27kB  3.48%  github.com/google/cadvisor/pages/static.Asset
         0     0%   100%  1184.27kB  3.48%  github.com/google/cadvisor/pages/static.bindataRead
         0     0%   100%  1184.27kB  3.48%  github.com/google/cadvisor/pages/static.init
         0     0%   100%  1184.27kB  3.48%  github.com/google/cadvisor/pages/static.pagesAssetsJsGchartsJs
         0     0%   100%  1184.27kB  3.48%  github.com/google/cadvisor/pages/static.pagesAssetsJsGchartsJsBytes
         0     0%   100%  1536.23kB  4.52%  github.com/google/cadvisor/utils/cloudinfo.init
         0     0%   100%  1536.23kB  4.52%  github.com/google/cadvisor/vendor/github.com/aws/aws-sdk-go/aws.init
         0     0%   100%  1028.03kB  3.02%  github.com/google/cadvisor/vendor/github.com/containerd/containerd/api/services/containers/v1.init
         0     0%   100%   513.31kB  1.51%  github.com/google/cadvisor/vendor/github.com/docker/distribution/reference.anchored
         0     0%   100%   513.31kB  1.51%  github.com/google/cadvisor/vendor/github.com/docker/distribution/reference.init
         0     0%   100%   513.31kB  1.51%  github.com/google/cadvisor/vendor/github.com/docker/docker/client.init
         0     0%   100%  7680.15kB 22.57%  github.com/google/cadvisor/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs.(*BlkioGroup).GetStats
         0     0%   100%  7680.15kB 22.57%  github.com/google/cadvisor/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs.(*Manager).GetStats

By commenting this out I can get memory to around 50% and reduce CPU by more than 50%

diff --git a/cadvisor.go b/cadvisor.go
index 5336cb4b..8ae302c4 100644
--- a/cadvisor.go
+++ b/cadvisor.go
@@ -244,7 +244,7 @@ func toIncludedMetrics(ignoreMetrics container.MetricSet) container.MetricSet {
                container.PerCpuUsageMetrics,
                container.MemoryUsageMetrics,
                container.CpuLoadMetrics,
-               container.DiskIOMetrics,
+               // container.DiskIOMetrics,
                container.DiskUsageMetrics,
                container.NetworkUsageMetrics,
                container.NetworkTcpUsageMetrics,
diff --git a/container/libcontainer/handler.go b/container/libcontainer/handler.go
index 18c465f2..d9ad505e 100644
--- a/container/libcontainer/handler.go
+++ b/container/libcontainer/handler.go
@@ -484,14 +484,14 @@ func getNumberOnlineCPUs() (uint32, error) {
 }

 func setDiskIoStats(s *cgroups.Stats, ret *info.ContainerStats) {
-       ret.DiskIo.IoServiceBytes = DiskStatsCopy(s.BlkioStats.IoServiceBytesRecursive)
-       ret.DiskIo.IoServiced = DiskStatsCopy(s.BlkioStats.IoServicedRecursive)
-       ret.DiskIo.IoQueued = DiskStatsCopy(s.BlkioStats.IoQueuedRecursive)
-       ret.DiskIo.Sectors = DiskStatsCopy(s.BlkioStats.SectorsRecursive)
-       ret.DiskIo.IoServiceTime = DiskStatsCopy(s.BlkioStats.IoServiceTimeRecursive)
-       ret.DiskIo.IoWaitTime = DiskStatsCopy(s.BlkioStats.IoWaitTimeRecursive)
-       ret.DiskIo.IoMerged = DiskStatsCopy(s.BlkioStats.IoMergedRecursive)
-       ret.DiskIo.IoTime = DiskStatsCopy(s.BlkioStats.IoTimeRecursive)
+       // ret.DiskIo.IoServiceBytes = DiskStatsCopy(s.BlkioStats.IoServiceBytesRecursive)
+       // ret.DiskIo.IoServiced = DiskStatsCopy(s.BlkioStats.IoServicedRecursive)
+       // ret.DiskIo.IoQueued = DiskStatsCopy(s.BlkioStats.IoQueuedRecursive)
+       // ret.DiskIo.Sectors = DiskStatsCopy(s.BlkioStats.SectorsRecursive)
+       // ret.DiskIo.IoServiceTime = DiskStatsCopy(s.BlkioStats.IoServiceTimeRecursive)
+       // ret.DiskIo.IoWaitTime = DiskStatsCopy(s.BlkioStats.IoWaitTimeRecursive)
+       // ret.DiskIo.IoMerged = DiskStatsCopy(s.BlkioStats.IoMergedRecursive)
+       // ret.DiskIo.IoTime = DiskStatsCopy(s.BlkioStats.IoTimeRecursive)
 }

 func setMemoryStats(s *cgroups.Stats, ret *info.ContainerStats) {

After this is commented out I am still left with:

Showing nodes accounting for 5795.77kB, 100% of 5795.77kB total
      flat  flat%   sum%        cum   cum%
 1537.31kB 26.52% 26.52%  1537.31kB 26.52%  github.com/google/cadvisor/container/libcontainer.newContainerStats
 1184.27kB 20.43% 46.96%  1184.27kB 20.43%  bytes.makeSlice
 1024.08kB 17.67% 64.63%  3073.45kB 53.03%  github.com/google/cadvisor/container/libcontainer.(*Handler).GetStats
  513.31kB  8.86% 73.48%   513.31kB  8.86%  regexp/syntax.(*compiler).inst (inline)
  512.69kB  8.85% 82.33%   512.69kB  8.85%  github.com/google/cadvisor/container/raw.(*rawContainerHandler).getFsStats
  512.06kB  8.84% 91.17%   512.06kB  8.84%  strings.Replace
  512.05kB  8.83%   100%   512.05kB  8.83%  path.(*lazybuf).string (inline)
         0     0%   100%  1184.27kB 20.43%  bytes.(*Buffer).ReadFrom
         0     0%   100%  1184.27kB 20.43%  bytes.(*Buffer).grow
         0     0%   100%   513.31kB  8.86%  github.com/google/cadvisor/api.init
         0     0%   100%   512.05kB  8.83%  github.com/google/cadvisor/container.NewContainerHandler
         0     0%   100%  3073.45kB 53.03%  github.com/google/cadvisor/container/docker.(*dockerContainerHandler).GetStats
         0     0%   100%   512.05kB  8.83%  github.com/google/cadvisor/container/docker.(*dockerFactory).NewContainerHandler
         0     0%   100%   513.31kB  8.86%  github.com/google/cadvisor/container/docker.init
         0     0%   100%   512.05kB  8.83%  github.com/google/cadvisor/container/docker.newDockerContainerHandler
         0     0%   100%   512.06kB  8.84%  github.com/google/cadvisor/container/libcontainer.networkStatsFromProc
         0     0%   100%   512.06kB  8.84%  github.com/google/cadvisor/container/libcontainer.scanInterfaceStats
         0     0%   100%   512.69kB  8.85%  github.com/google/cadvisor/container/raw.(*rawContainerHandler).GetStats
         0     0%   100%  1697.59kB 29.29%  github.com/google/cadvisor/http.init
         0     0%   100%  3586.14kB 61.88%  github.com/google/cadvisor/manager.(*containerData).housekeeping
         0     0%   100%  3586.14kB 61.88%  github.com/google/cadvisor/manager.(*containerData).housekeepingTick
         0     0%   100%  3586.14kB 61.88%  github.com/google/cadvisor/manager.(*containerData).updateStats
         0     0%   100%   512.05kB  8.83%  github.com/google/cadvisor/manager.(*manager).Start
         0     0%   100%   512.05kB  8.83%  github.com/google/cadvisor/manager.(*manager).createContainer
         0     0%   100%   512.05kB  8.83%  github.com/google/cadvisor/manager.(*manager).createContainerLocked
         0     0%   100%   512.05kB  8.83%  github.com/google/cadvisor/manager.(*manager).detectSubcontainers
         0     0%   100%   513.31kB  8.86%  github.com/google/cadvisor/manager.init
         0     0%   100%  1184.27kB 20.43%  github.com/google/cadvisor/pages/static.Asset
         0     0%   100%  1184.27kB 20.43%  github.com/google/cadvisor/pages/static.bindataRead
         0     0%   100%  1184.27kB 20.43%  github.com/google/cadvisor/pages/static.init
         0     0%   100%  1184.27kB 20.43%  github.com/google/cadvisor/pages/static.pagesAssetsJsGchartsJs
         0     0%   100%  1184.27kB 20.43%  github.com/google/cadvisor/pages/static.pagesAssetsJsGchartsJsBytes
         0     0%   100%   513.31kB  8.86%  github.com/google/cadvisor/vendor/github.com/docker/distribution/reference.anchored
         0     0%   100%   513.31kB  8.86%  github.com/google/cadvisor/vendor/github.com/docker/distribution/reference.init
         0     0%   100%   513.31kB  8.86%  github.com/google/cadvisor/vendor/github.com/docker/docker/client.init
         0     0%   100%  1184.27kB 20.43%  io.Copy
         0     0%   100%  1184.27kB 20.43%  io.copyBuffer
         0     0%   100%  1697.59kB 29.29%  main.init
         0     0%   100%   512.05kB  8.83%  main.main
         0     0%   100%   512.05kB  8.83%  path.Clean
         0     0%   100%   512.05kB  8.83%  path.Join
         0     0%   100%   513.31kB  8.86%  regexp.Compile
         0     0%   100%   513.31kB  8.86%  regexp.MustCompile
         0     0%   100%   513.31kB  8.86%  regexp.compile
         0     0%   100%   513.31kB  8.86%  regexp/syntax.(*compiler).compile
         0     0%   100%   513.31kB  8.86%  regexp/syntax.(*compiler).rune
         0     0%   100%   513.31kB  8.86%  regexp/syntax.Compile
         0     0%   100%  2209.63kB 38.12%  runtime.main

Which is not too unreasonable however RSS for the same process dumped here is 50MB, so my guess here is that the majority of the memory here is just libraries loaded vs actual retained data.

This makes me wonder a few things

SamSaffron commented 5 years ago

I did realize something... should we not be de-duplicating this string?

https://github.com//blob/b36e6fb63ac7099a420139a53a411d70c9cc2553/container/libcontainer/helpers.go#L133-L136

dashpole commented 5 years ago

@SamSaffron thanks a bunch for this. It is quite helpful. I agree that we should reduce the set of metrics provided by the kubelet. You can see an overview of the roadmap here: https://github.com/kubernetes/kubernetes/issues/68522.

Does it make sense to ship a "minimal" build of cadvisor with "docker only" so people only using it to monitor containers don't need to load up mesos/rkt/systmed/crio/aws and so on?

We should make use of https://github.com/google/cadvisor/pull/1926 in kubernetes to ignore mesos/systemd/aws, etc containers. We should also introduce an option to only collect raw cgroups (no docker, CRI-O, etc) for runtimes that provide container metrics via CRI.

Do we want a flag for -disable_metrics diskio we have one for disk now?

I would be happy to review a PR that adds diskIO as a metric that can be ignored.

The rough plan is for the kubelet to disable all metrics other than CPU/Memory/Disk starting in 1.15 to allow for a deprecation window.

port19x commented 1 year ago

If you're just using cadvisor for prometheus exporting, consider a storage duration of \~2x the scrapt duration. Helped me save a few megs

nvm, restart placebo