Open ffilippopoulos opened 5 years ago
I wonder if it is the fact that you are now running 2 cAdvisors at the same time? When you are running the daemonset, can you still query the kubelet's cAdvisor (:10255/metrics/cadvisor)
yes, that is true, we are still running the kubelet one and can query those metrics as well.
Can you post the output of docker info?
Containers: 109
Running: 79
Paused: 0
Stopped: 30
Images: 143
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: v0.13.2 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
seccomp
Profile: default
selinux
Kernel Version: 4.14.78-coreos
Operating System: Container Linux by CoreOS 1911.3.0 (Rhyolite)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.42GiB
Name: ip-10-66-21-100
ID: A5KU:5YMU:GZXF:UURN:XA7X:JCO4:FS3I:RYQ4:KGE2:FC4A:37F2:P52M
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Hmm, we haven't seen any issues recently with overlay2
can you try running with the cadvisor-args patch, and see if that helps? I'm wondering if there is an expensive metric we collect by default that that disables, or if the housekeeping interval is different...
We have the same issue (Using v32). Cadvisor sometimes fails to respond within 15 seconds. CPU usage of cadvisor becomes especially high when the node is under high load: Roughly 70% User 30% Sys time.
We already increased requests to 400m CPU, but as can be seen in the graph, cadvisor would need even more CPU.
I now applied the patch @dashpole proposed (slightly modified: --disable_metrics=percpu,disk,network,tcp,udp,sched
) and will see how it holds up in the next days.
So for the past 5 days we haven't had any peaks.
Directly after rolling out --disable_metrics=percpu,disk,network,tcp,udp,sched
(5 days ago) metrics immediately became better (rollout happened at the missing values in the middle of the picture):
And now it looks like I would expect it to:
Still a bit surprised by how much deviation there is between cadvisor instances, but our nodes are very differently loaded with containers, so this is probably be ok. Going forward I wonder what the best approach is. I could try different configurations, but this will take weeks to finish.
Problem with running with --disable_metrics=percpu,disk,network,tcp,udp,sched
- is that we are missing disk
and network
metrics, which were previously present in kubelet's cadvisor.
I raised https://github.com/google/cadvisor/pull/2236 - which raises parity in metrics provided, but the performance issue then returns.
I don't quite understand how kubelet can expose identical metrics without the impact to performance.
We are running a kubernetes cluster (
v1.11
) on aws ec2 instances with disks ofgp2
type and size of50
gigs. We trying to deploycAdvisor
as a daemonset to substitute running it as part of the kubelet (as rumour is that is going to become deprecated). We use the manifest from https://github.com/google/cadvisor/blob/master/deploy/kubernetes/base/daemonset.yaml to deploy the daemonset, plus a headless service in order to configure aprometheus
job forcAdvisor
usingdns_sd_config
.On this setup, most of
cAdvisor
pods start logging tons of messages complaining aboutdu
andfind
:and we see that prometheus timeouts when trying to scrape them. Actually prometheus doesn't have to do anything with it because just
curl
ing the metrics endpoint never returns.So it looks like the pods are stuck doingdu
andfind
operations and cannot handle the metrics endpoint in time.Running cAdvisor as part of kubelet works fine and we are able to scrape for metrics.
/cc @dashpole