google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.09k stars 2.32k forks source link

The relationship between huge CPU usage and options #2144

Open k0nstantinv opened 5 years ago

k0nstantinv commented 5 years ago

Hi, i have been using cAdvisor as a DaemonSet in Kubernetes cluster. My cluster consists of 100 nodes, each of the nodes in turn has ~80 pods. Each of the nodes is a high perfomance 'bare metal' server. DaemonSet deployed without any limits (i mean k8s limits) and here it is:

apiVersion: apps/v1beta2
kind: DaemonSet
metadata:
  name: cadvisor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: cadvisor
  template:
    metadata:
      labels:
        app: cadvisor
      name: cadvisor
    spec:
      tolerations:
      - operator: Exists
      containers:
      - name: cadvisor
        args:
          - --disable_metrics=disk,udp,percpu # enable only tcp
          - --docker_only
          - --housekeeping_interval=30s
          - --max_housekeeping_interval=60s
        image: k8s.gcr.io/cadvisor:v0.30.2
        resources:
          requests:
            cpu: 1000m
            memory: 512Mi
        volumeMounts:
        - name: var-run
          mountPath: /var/run
          readOnly: true
        - name: docker
          mountPath: /var/lib/docker
          readOnly: true
        - name: sys
          mountPath: /sys
          readOnly: true
        - name: rootfs
          mountPath: /rootfs
          readOnly: true
        ports:
          - name: http
            containerPort: 8080
            protocol: TCP
      automountServiceAccountToken: false
      terminationGracePeriodSeconds: 30
      volumes:
      - name: var-run
        hostPath:
          path: /var/run
      - name: docker
        hostPath:
          path: /var/lib/docker
      - name: sys
        hostPath:
          path: /sys
      - name: rootfs
        hostPath:
          path: /

As you can see I have already set up the --disable-metrics argument to the cAdvisor container. The only thing i need from the cAdvisor is a Prometheus metric container_network_tcp_usage_total, so everything except the TCP has been disabled. (Disabling metrics doesn't exclude them from the /metrics endpoint or causing them to be zero every time, by the way)

Before the main questions i'd like to show this:

--ignore_metrics option doesn't exist in my version --disable_metrics option has only: 'disk', 'network', 'tcp', 'udp', 'percpu' --housekeeping_interval is already a 30s --docker_only didn't help

So, i completely don't understand:

PS: why --disable_metrics option has network and tcp,udp at the same time? I suggested network is a tcp+udp, but metric container_network_tcp_usage_total is always zero without enabled network

dashpole commented 5 years ago

Disabling metrics doesn't exclude them from the /metrics endpoint or causing them to be zero every time, by the way

That is really odd. Which metrics does this happen for?

Your problem seems similar to https://github.com/google/cadvisor/issues/1774, as it manifests as inexplicably high CPU usage, but only on some machines.

now I must drop unnecessary metrics in the Prometheus and it was the best bad idea I could come up with. What is the best way to completely disable everything I don't need on the cAdvisor's side (including the /metrics endpoint)?

I added https://github.com/google/cadvisor/pull/1980 just after the version you are using was cut. Try bumping the version to v0.31.0 or higher.

PS: why --disable_metrics option has network and tcp,udp at the same time?

I agree the naming is confusing. They are meant to be non-overlapping sets of metrics, as tcp/udp create an enormous number of additional metric streams compared with the basic metrics.

I suggested network is a tcp+udp, but metric container_network_tcp_usage_total is always zero without enabled network

That sounds like a bug.

k0nstantinv commented 5 years ago

@dashpole thanks. I first tried latest tag, but it was outdated, so I desided to use v0.30.2 the same way as in https://github.com/google/cadvisor/blob/master/deploy/kubernetes/base/daemonset.yaml I'm not sure which tag should i test now. Can you advise?

dashpole commented 5 years ago

yeah, I need to update that. I would try the new latest (as of yesterday), v0.32.0

k0nstantinv commented 5 years ago

Seems like disabling network really causing tcp metrics be zero even if I use the v.0.32.0 tag. I have 4 running containers on localhost:

CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS              PORTS                    NAMES
5f820c33c9ea        google/cadvisor:v0.32.0   "/usr/bin/cadvisor -…"   2 minutes ago       Up 2 minutes        0.0.0.0:8080->8080/tcp   cadvisor
40e0315e658f        debian:wheezy             "bash"                   4 weeks ago         Up 47 seconds                                puppet-agent-2
cabe397f89ab        devopsil/puppet           "bash"                   4 weeks ago         Up About a minute                            puppet-agent
a37616aab79a        devopsil/puppet           "bash"                   4 weeks ago         Up 29 seconds                                puppet-master 

As you can see i have Puppet master with 2 registered agents and they are configured correctly:

$ docker exec puppet-agent puppet agent -t
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Caching catalog for puppet-agent.my.local
Info: Applying configuration version '1547453764'
Notice: Finished catalog run in 0.02 seconds

$ docker exec puppet-agent-2 puppet agent -t
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Caching catalog for puppet-agent-2.my.local
Info: Applying configuration version '1547453764'
Notice: Finished catalog run in 0.01 seconds

[root@puppet-master /]# netstat -tlnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name
tcp        0      0 0.0.0.0:8140                0.0.0.0:*                   LISTEN      -
tcp        0      0 127.0.0.11:42191            0.0.0.0:*                   LISTEN      -

If cAdvisor's command to run is equal to:

$ docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  google/cadvisor:v0.32.0 --disable_metrics=udp,network --docker_only

then:

container_network_tcp_usage_total{container_label_build_date="20180402",container_label_license="GPLv2",container_label_name="CentOS Base Image",container_label_vendor="CentOS",id="/docker/a37616aab79a1374046aada7bb69d0c6ed41c63953098bc450dae7166868c5ec",image="devopsil/puppet",name="puppet-master",tcp_state="listen"} 0

If cAdvisor's command to run is equal to:

$ docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  google/cadvisor:v0.32.0 --disable_metrics=udp --docker_only

then it is starting to show values:

container_network_tcp_usage_total{container_label_build_date="20180402",container_label_license="GPLv2",container_label_name="CentOS Base Image",container_label_vendor="CentOS",id="/docker/a37616aab79a1374046aada7bb69d0c6ed41c63953098bc450dae7166868c5ec",image="devopsil/puppet",name="puppet-master",tcp_state="listen"} 2

Then, disabling all the metrics except tcp via:

--disable_metrics=network,udp,percpu,sched,process --docker_only

makes the /metrics endpoint return this list of metrics:

container_cpu_load_average_10s
container_cpu_system_seconds_total
container_cpu_usage_seconds_total
container_cpu_user_seconds_total
container_fs_inodes_free
container_fs_inodes_total
container_fs_io_current
container_fs_io_time_seconds_total
container_fs_io_time_weighted_seconds_total
container_fs_limit_bytes
container_fs_read_seconds_total
container_fs_reads_bytes_total
container_fs_reads_merged_total
container_fs_reads_total
container_fs_sector_reads_total
container_fs_sector_writes_total
container_fs_usage_bytes
container_fs_write_seconds_total
container_fs_writes_bytes_total
container_fs_writes_merged_total
container_fs_writes_total
container_last_seen
container_memory_cache
container_memory_failcnt
container_memory_failures_total
container_memory_mapped_file
container_memory_max_usage_bytes
container_memory_rss
container_memory_swap
container_memory_usage_bytes
container_memory_working_set_bytes
container_network_tcp_usage_total
container_scrape_error
container_spec_cpu_period
container_spec_cpu_shares
container_spec_memory_limit_bytes
container_spec_memory_reservation_limit_bytes
container_spec_memory_swap_limit_bytes
container_start_time_seconds
container_tasks_state

Well...with the current version I can't see always zero values and disabled metrics really disappeared from the endpoint, but it still has a group of metrics I don't need. --disable_metrics option does not provide mechanism to exclude above metrics from the endpoint. You would certainly know better than I, but I think this is incorrect.

dashpole commented 5 years ago

Someone is working on the container_fs metrics: https://github.com/google/cadvisor/pull/2103

No one has ever requested disabling cpu/memory before, but we could add it.

Seems like disabling network really causing tcp metrics be zero even if I use the v.0.32.0 tag.

Yes, this is a bug. I'll look into it sometime...

sevagh commented 4 years ago

No one has ever requested disabling cpu/memory before, but we could add it.

I would like this as well. We get plenty of memory and CPU stats from Nomad - I'm looking to use cadvisor only for network stats.

edit: my other problems were covered here

eero-t commented 3 years ago

Because:

IMHO it would be better to have an option to enable specific metrics in addition to an option for disabling them. Logic would be that if '-enable_metrics' has non-empty set, it overrides the -disable_metrics set.

(If that seems reasonable, I could create PR implementing it.)