Open sabras75 opened 6 years ago
Does this issue happen for short periods of time, or is it permanent? Can you check the cAdvisor logs to see if it is logging any error messages related to connecting to docker?
This is permanent. I'll check the logs and post it in a future comment.
here are the logs. I raise the verbosity to get something that I hope to be interesting (--v=42) cadvisor-log.txt
out of the log, maybe those messages can be helpfull to you
I0117 13:23:08.720138 1 manager.go:158] Docker not connected: context deadline exceeded
I0117 13:23:18.745673 1 manager.go:270] Registration of the Docker container factory failed: failed to validate Docker info: failed to detect Docker info: context deadline exceeded.
I0117 13:23:21.053932 1 manager.go:970] Added container: "/docker/6cd62bf3133bcc6fd49b1f0c23d4d08c194011eb774afe553535068d3da6b502" (aliases: [], namespace: "") I0117 13:23:21.054319 1 handler.go:325] Added event &{/docker/6cd62bf3133bcc6fd49b1f0c23d4d08c194011eb774afe553535068d3da6b502 2018-01-17 12:46:31.779005956 +0000 UTC containerCreation {
}} I0117 13:23:21.054359 1 factory.go:105] Error trying to work out if we can handle /docker/a1afe3b2d76c7a268331cd1ee6a3814d8bd0e84e9619556d35e7d530b3ee8a11: /docker/a1afe3b2d76c7a268331cd1ee6a3814d8bd0e84e9619556d35e7d530b3ee8a11 not handled by systemd handler I0117 13:23:21.054371 1 factory.go:116] Factory "systemd" was unable to handle container "/docker/a1afe3b2d76c7a268331cd1ee6a3814d8bd0e84e9619556d35e7d530b3ee8a11" I0117 13:23:21.054384 1 factory.go:112] Using factory "raw" for container "/docker/a1afe3b2d76c7a268331cd1ee6a3814d8bd0e84e9619556d35e7d530b3ee8a11"
For information, on this machine, the docker info command seems rather slow (like 6.7 seconds). (see below) So maybe, is it just simply a timeout in cadvisor. What is the value of the timeout ? Is there a way to alter this value from command line or a config file ?
$ time docker info Containers: 90 Running: 86 Paused: 0 Stopped: 4 Images: 17632 Server Version: 17.12.0-ce Storage Driver: aufs Root Dir: /var/lib/docker/aufs Backing Filesystem: extfs Dirs: 13046 Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 89623f28b87a6004d4b785663257362d1658a729 runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f init version: 949e6fa Security Options: apparmor seccomp Profile: default Kernel Version: 4.4.0-108-generic Operating System: Ubuntu 16.04.3 LTS OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 55.03GiB Name: jenkins-master ID: IFYG:3LOB:TGH3:NXLV:NYZX:6XS6:BQVL:ZMEJ:AQIC:AVUP:RO7A:MH5Q Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
WARNING: No swap limit support
real 0m6.710s user 0m0.016s sys 0m0.004s
I believe cadvisor didn't start up correctly. What version of cAdvisor are you on?
It seems like a bug if we allow docker to timeout during factory registration.
cc @jsravn
The version of cAdvisor is the one from the public docker image "google/cadvisor:v0.28.3"
It looks like you're hitting the timeout which is hard coded to 5s. I'm surprised your docker info
takes so long - I haven't observed it take more than 1s even with hundreds of containers in my own testing.
I suggest we 1. increase the default timeout and 2. make timeout configurable via cadvisor flag.
I'm not sure if start-up should ignore the timeout - as docker calls can hang indefinitely, so that would just cause cadvisor to hang on startup indefinitely.
@dashpole let me know what you think is the best approach. The simplest for now would be to just increase the timeout.
I'm not too happy with the hardcoded value. But I'm not sure the best way to make it configurable, because it will break the API all over the place to pass a timeout around. And there isn't a nice way to add config values (e.g. a conf struct on manager.New). I could do it via a global var on the docker package, but it's pretty horrible.
We could also revert the timeout change for now and do it properly via a breaking API change in master. I notice as well none of the rkt and cri-o calls have timeouts.
@jsravn can we add an indefinite retry for startup? The 5s timeout is probably fine...
@dashpole Okay, I made https://github.com/google/cadvisor/pull/1871 for it. I think we should bump the default timeout up as well (better safe than sorry).
On some of our computer (of course, the production ones :( ), cadvisor does not returns a full spec of the container: it is missing the aliases, namespace and labels.
For instance, when calling the cAdvisor api
it returns :
{ "/docker/e75fc4a8077cff7c1060f92a32827be5dda25fc793f15566bedfa7a2c82a5bf2": { "spec": { "creation_time": "2018-01-16T10:18:03.546818282Z", "has_cpu": true, "cpu": { "limit": 1024, "max_limit": 0, "mask": "0-15" }, "has_memory": true, "memory": { "limit": 9223372036854772000, "reservation": 9223372036854772000 }, "has_custom_metrics": false, "has_network": false, "has_filesystem": false, "has_diskio": true }, "stats": [ { ...
Calling docker inspect on the same container
returns (see following attachment) : docker-inspect.txt
I was expecting to have a namespace 'docker' and at least an alias equivalent to the name returned by docker inspect (
"Name": "/lifyqalmomsgr9ngzj_cadvisor_1"
) as this is the behaviour found on our development computers.Interestingly, on the UI when reaching the route "http://localhost:8080/docker/", it returns "failed to get docker info: context deadline exceeded", so we suspect some kind of timeout but are not able to diagnose those with precision and hopefully find a solution.
We would greatly appreciate any help in solving this issue.
thanks,
-Sebastien