google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.12k stars 2.32k forks source link

Missing aliases from spec section #1866

Open sabras75 opened 6 years ago

sabras75 commented 6 years ago

On some of our computer (of course, the production ones :( ), cadvisor does not returns a full spec of the container: it is missing the aliases, namespace and labels.

For instance, when calling the cAdvisor api

http GET http://10.132.48.8:8080/api/v2.1/stats/docker/e75fc4a8077cff7c1060f92a32827be5dda25fc793f15566bedfa7a2c82a5bf2)

it returns : { "/docker/e75fc4a8077cff7c1060f92a32827be5dda25fc793f15566bedfa7a2c82a5bf2": { "spec": { "creation_time": "2018-01-16T10:18:03.546818282Z", "has_cpu": true, "cpu": { "limit": 1024, "max_limit": 0, "mask": "0-15" }, "has_memory": true, "memory": { "limit": 9223372036854772000, "reservation": 9223372036854772000 }, "has_custom_metrics": false, "has_network": false, "has_filesystem": false, "has_diskio": true }, "stats": [ { ...

Calling docker inspect on the same container

docker inspect e75fc4a8077cff7c1060f92a32827be5dda25fc793f15566bedfa7a2c82a5bf2

returns (see following attachment) : docker-inspect.txt

I was expecting to have a namespace 'docker' and at least an alias equivalent to the name returned by docker inspect ("Name": "/lifyqalmomsgr9ngzj_cadvisor_1") as this is the behaviour found on our development computers.

Interestingly, on the UI when reaching the route "http://localhost:8080/docker/", it returns "failed to get docker info: context deadline exceeded", so we suspect some kind of timeout but are not able to diagnose those with precision and hopefully find a solution.

We would greatly appreciate any help in solving this issue.

thanks,

-Sebastien

Docker version 17.12.0-ce cadvisor runnign in docker with image: google/cadvisor:v0.28.3

dashpole commented 6 years ago

Does this issue happen for short periods of time, or is it permanent? Can you check the cAdvisor logs to see if it is logging any error messages related to connecting to docker?

sabras75 commented 6 years ago

This is permanent. I'll check the logs and post it in a future comment.

sabras75 commented 6 years ago

here are the logs. I raise the verbosity to get something that I hope to be interesting (--v=42) cadvisor-log.txt

out of the log, maybe those messages can be helpfull to you

I0117 13:23:08.720138 1 manager.go:158] Docker not connected: context deadline exceeded

I0117 13:23:18.745673 1 manager.go:270] Registration of the Docker container factory failed: failed to validate Docker info: failed to detect Docker info: context deadline exceeded.

I0117 13:23:21.053932 1 manager.go:970] Added container: "/docker/6cd62bf3133bcc6fd49b1f0c23d4d08c194011eb774afe553535068d3da6b502" (aliases: [], namespace: "") I0117 13:23:21.054319 1 handler.go:325] Added event &{/docker/6cd62bf3133bcc6fd49b1f0c23d4d08c194011eb774afe553535068d3da6b502 2018-01-17 12:46:31.779005956 +0000 UTC containerCreation {}} I0117 13:23:21.054359 1 factory.go:105] Error trying to work out if we can handle /docker/a1afe3b2d76c7a268331cd1ee6a3814d8bd0e84e9619556d35e7d530b3ee8a11: /docker/a1afe3b2d76c7a268331cd1ee6a3814d8bd0e84e9619556d35e7d530b3ee8a11 not handled by systemd handler I0117 13:23:21.054371 1 factory.go:116] Factory "systemd" was unable to handle container "/docker/a1afe3b2d76c7a268331cd1ee6a3814d8bd0e84e9619556d35e7d530b3ee8a11" I0117 13:23:21.054384 1 factory.go:112] Using factory "raw" for container "/docker/a1afe3b2d76c7a268331cd1ee6a3814d8bd0e84e9619556d35e7d530b3ee8a11"

sabras75 commented 6 years ago

For information, on this machine, the docker info command seems rather slow (like 6.7 seconds). (see below) So maybe, is it just simply a timeout in cadvisor. What is the value of the timeout ? Is there a way to alter this value from command line or a config file ?

$ time docker info Containers: 90 Running: 86 Paused: 0 Stopped: 4 Images: 17632 Server Version: 17.12.0-ce Storage Driver: aufs Root Dir: /var/lib/docker/aufs Backing Filesystem: extfs Dirs: 13046 Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 89623f28b87a6004d4b785663257362d1658a729 runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f init version: 949e6fa Security Options: apparmor seccomp Profile: default Kernel Version: 4.4.0-108-generic Operating System: Ubuntu 16.04.3 LTS OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 55.03GiB Name: jenkins-master ID: IFYG:3LOB:TGH3:NXLV:NYZX:6XS6:BQVL:ZMEJ:AQIC:AVUP:RO7A:MH5Q Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

WARNING: No swap limit support

real 0m6.710s user 0m0.016s sys 0m0.004s

dashpole commented 6 years ago

I believe cadvisor didn't start up correctly. What version of cAdvisor are you on?

dashpole commented 6 years ago

It seems like a bug if we allow docker to timeout during factory registration.

dashpole commented 6 years ago

cc @jsravn

sabras75 commented 6 years ago

The version of cAdvisor is the one from the public docker image "google/cadvisor:v0.28.3"

jsravn commented 6 years ago

It looks like you're hitting the timeout which is hard coded to 5s. I'm surprised your docker info takes so long - I haven't observed it take more than 1s even with hundreds of containers in my own testing.

I suggest we 1. increase the default timeout and 2. make timeout configurable via cadvisor flag.

I'm not sure if start-up should ignore the timeout - as docker calls can hang indefinitely, so that would just cause cadvisor to hang on startup indefinitely.

jsravn commented 6 years ago

@dashpole let me know what you think is the best approach. The simplest for now would be to just increase the timeout.

I'm not too happy with the hardcoded value. But I'm not sure the best way to make it configurable, because it will break the API all over the place to pass a timeout around. And there isn't a nice way to add config values (e.g. a conf struct on manager.New). I could do it via a global var on the docker package, but it's pretty horrible.

We could also revert the timeout change for now and do it properly via a breaking API change in master. I notice as well none of the rkt and cri-o calls have timeouts.

dashpole commented 6 years ago

@jsravn can we add an indefinite retry for startup? The 5s timeout is probably fine...

jsravn commented 6 years ago

@dashpole Okay, I made https://github.com/google/cadvisor/pull/1871 for it. I think we should bump the default timeout up as well (better safe than sorry).