Open k0nstantinv opened 5 years ago
Disabling metrics doesn't exclude them from the /metrics endpoint or causing them to be zero every time, by the way
That is really odd. Which metrics does this happen for?
Your problem seems similar to https://github.com/google/cadvisor/issues/1774, as it manifests as inexplicably high CPU usage, but only on some machines.
now I must drop unnecessary metrics in the Prometheus and it was the best bad idea I could come up with. What is the best way to completely disable everything I don't need on the cAdvisor's side (including the /metrics endpoint)?
I added https://github.com/google/cadvisor/pull/1980 just after the version you are using was cut. Try bumping the version to v0.31.0
or higher.
PS: why --disable_metrics option has network and tcp,udp at the same time?
I agree the naming is confusing. They are meant to be non-overlapping sets of metrics, as tcp/udp create an enormous number of additional metric streams compared with the basic metrics.
I suggested network is a tcp+udp, but metric container_network_tcp_usage_total is always zero without enabled network
That sounds like a bug.
@dashpole thanks. I first tried latest tag, but it was outdated, so I desided to use v0.30.2
the same way as in https://github.com/google/cadvisor/blob/master/deploy/kubernetes/base/daemonset.yaml
I'm not sure which tag should i test now. Can you advise?
yeah, I need to update that. I would try the new latest (as of yesterday), v0.32.0
Seems like disabling network
really causing tcp metrics be zero even if I use the v.0.32.0
tag. I have 4 running containers on localhost:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5f820c33c9ea google/cadvisor:v0.32.0 "/usr/bin/cadvisor -…" 2 minutes ago Up 2 minutes 0.0.0.0:8080->8080/tcp cadvisor
40e0315e658f debian:wheezy "bash" 4 weeks ago Up 47 seconds puppet-agent-2
cabe397f89ab devopsil/puppet "bash" 4 weeks ago Up About a minute puppet-agent
a37616aab79a devopsil/puppet "bash" 4 weeks ago Up 29 seconds puppet-master
As you can see i have Puppet master with 2 registered agents and they are configured correctly:
$ docker exec puppet-agent puppet agent -t
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Caching catalog for puppet-agent.my.local
Info: Applying configuration version '1547453764'
Notice: Finished catalog run in 0.02 seconds
$ docker exec puppet-agent-2 puppet agent -t
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Caching catalog for puppet-agent-2.my.local
Info: Applying configuration version '1547453764'
Notice: Finished catalog run in 0.01 seconds
[root@puppet-master /]# netstat -tlnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:8140 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.11:42191 0.0.0.0:* LISTEN -
If cAdvisor's command to run is equal to:
$ docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:v0.32.0 --disable_metrics=udp,network --docker_only
then:
container_network_tcp_usage_total{container_label_build_date="20180402",container_label_license="GPLv2",container_label_name="CentOS Base Image",container_label_vendor="CentOS",id="/docker/a37616aab79a1374046aada7bb69d0c6ed41c63953098bc450dae7166868c5ec",image="devopsil/puppet",name="puppet-master",tcp_state="listen"} 0
If cAdvisor's command to run is equal to:
$ docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:v0.32.0 --disable_metrics=udp --docker_only
then it is starting to show values:
container_network_tcp_usage_total{container_label_build_date="20180402",container_label_license="GPLv2",container_label_name="CentOS Base Image",container_label_vendor="CentOS",id="/docker/a37616aab79a1374046aada7bb69d0c6ed41c63953098bc450dae7166868c5ec",image="devopsil/puppet",name="puppet-master",tcp_state="listen"} 2
Then, disabling all the metrics except tcp via:
--disable_metrics=network,udp,percpu,sched,process --docker_only
makes the /metrics
endpoint return this list of metrics:
container_cpu_load_average_10s
container_cpu_system_seconds_total
container_cpu_usage_seconds_total
container_cpu_user_seconds_total
container_fs_inodes_free
container_fs_inodes_total
container_fs_io_current
container_fs_io_time_seconds_total
container_fs_io_time_weighted_seconds_total
container_fs_limit_bytes
container_fs_read_seconds_total
container_fs_reads_bytes_total
container_fs_reads_merged_total
container_fs_reads_total
container_fs_sector_reads_total
container_fs_sector_writes_total
container_fs_usage_bytes
container_fs_write_seconds_total
container_fs_writes_bytes_total
container_fs_writes_merged_total
container_fs_writes_total
container_last_seen
container_memory_cache
container_memory_failcnt
container_memory_failures_total
container_memory_mapped_file
container_memory_max_usage_bytes
container_memory_rss
container_memory_swap
container_memory_usage_bytes
container_memory_working_set_bytes
container_network_tcp_usage_total
container_scrape_error
container_spec_cpu_period
container_spec_cpu_shares
container_spec_memory_limit_bytes
container_spec_memory_reservation_limit_bytes
container_spec_memory_swap_limit_bytes
container_start_time_seconds
container_tasks_state
Well...with the current version I can't see always zero values and disabled metrics really disappeared from the endpoint, but it still has a group of metrics I don't need. --disable_metrics
option does not provide mechanism to exclude above metrics from the endpoint. You would certainly know better than I, but I think this is incorrect.
Someone is working on the container_fs metrics: https://github.com/google/cadvisor/pull/2103
No one has ever requested disabling cpu/memory before, but we could add it.
Seems like disabling network really causing tcp metrics be zero even if I use the v.0.32.0 tag.
Yes, this is a bug. I'll look into it sometime...
No one has ever requested disabling cpu/memory before, but we could add it.
I would like this as well. We get plenty of memory and CPU stats from Nomad - I'm looking to use cadvisor only for network stats.
edit: my other problems were covered here
Because:
IMHO it would be better to have an option to enable specific metrics in addition to an option for disabling them. Logic would be that if '-enable_metrics' has non-empty set, it overrides the -disable_metrics set.
(If that seems reasonable, I could create PR implementing it.)
Hi, i have been using cAdvisor as a DaemonSet in Kubernetes cluster. My cluster consists of 100 nodes, each of the nodes in turn has ~80 pods. Each of the nodes is a high perfomance 'bare metal' server. DaemonSet deployed without any limits (i mean k8s limits) and here it is:
As you can see I have already set up the
--disable-metrics
argument to the cAdvisor container. The only thing i need from the cAdvisor is a Prometheus metriccontainer_network_tcp_usage_total
, so everything except the TCP has been disabled. (Disabling metrics doesn't exclude them from the/metrics
endpoint or causing them to be zero every time, by the way)Before the main questions i'd like to show this:
cAdvisor pod with the lowest CPU usage across the cluster consumes less then 2% of a single core:
I've already read the similar issues here and i noticed some advises: 1) >Try using --ignore-metrics to disable metrics you are not using. Particularly expensive metrics are disk, diskIO, UDP, and tcp metrics. 2) >I would recommend increasing housekeeping even higher to 10s
--ignore_metrics
option doesn't exist in my version--disable_metrics
option has only:'disk', 'network', 'tcp', 'udp', 'percpu'
--housekeeping_interval
is already a 30s--docker_only
didn't helpSo, i completely don't understand:
/metrics
endpoint)?PS: why
--disable_metrics
option hasnetwork
andtcp,udp
at the same time? I suggestednetwork
is atcp+udp
, but metriccontainer_network_tcp_usage_total
is always zero without enablednetwork