Open bboreham opened 6 years ago
What version of docker are you using? What docker storage driver are you using? Does it happen consistently, or only occasionally?
Can you verify that the r/w lager id on both good/bad nodes is stored in /var/lib/docker/image/overlay2/layerdb/mounts/<CONTAINER_UID>/mount-id
?
cAdvisor operates independently of kubelet. The kubelet only queries cAdvisor, and doesn't provide it any information. So it is purely a cAdvisor issue. It seems like the correct behavior here is to continue on, even if we do not have a r/w layer id for the docker container (only drop the "image" field, but not the others). There is also likely an underlying bug with the cAdvisor monitoring for a docker storage driver, or an incompatibility with a more recent docker version
I think it happens infrequently, although the only evidence is that nobody noticed it happening much. I have set a Prometheus alarm on count by (instance)(container_memory_rss) unless count by (instance)(container_memory_rss{namespace!=""})
which should let us pick it up faster next time.
I don't have any bad nodes at the moment, since we have variously restarted and blown things away. I will check next time it happens.
The Docker root dir is nonstandard, but I do have a bunch of files at /mnt/containers/docker/image/overlay2/layerdb/mounts/<CONTAINER_UID>/mount-id
. What am I looking for wrt "r/w lager id" ?
# docker info
Containers: 116
Running: 105
Paused: 0
Stopped: 11
Images: 2853
Server Version: 1.12.2
Storage Driver: overlay2
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-119-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.44 GiB
Name: ip-172-20-2-209
ID: 4KM4:IOI7:XXAZ:NZG7:XR3K:ETZG:JKWT:S5I2:V7VX:PHUX:CISI:MI2A
Docker Root Dir: /mnt/containers/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
sorry, typo. Should be r/w layer id
, not r/w lager id
.
From your log, it looks like cAdvisor isn't respecting the nonstandard docker root dir, since it is failing to open the file /var/lib/docker/image/overlay2/layerdb/mounts/<CONTAINER_ID/mount-id
, when it should be looking in /mnt/containers/docker/image/overlay2/layerdb/mounts/<CONTAINER_ID>/mount-id
. But if that is the case, I am surprised it is only happening occasionally. Can you check your "good" nodes, and see if you still see that error message? I am wondering if it is a red herring
It's happened again, on one node in each of two clusters, both immediately after a reboot.
We do not get the /var/lib/docker/image
messages on working nodes.
I suspect a race in RootDir()
which falls back to the "wrong" directory if it encounters an error, without logging. Immediately after reboot it may not be possible to talk to Docker.
I'll try setting the --docker-root
flag
It happened in our cluster. docker graph changed to /data0/docker.
container / pod name lost in :10255/metrics/cadvisor
And lot of kubelet logs:
Jul 10 08:51:54 j-daily-e2e-control-worker-edge-storage-01 kubelet[2129]: E0710 08:51:54.776667 2129 manager.go:1130] Failed to create existing container: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd66a596523b1edaba6565ad02fbf61d1.slice/docker-1f43e4af14b5bb32a708f5a49546ee69ceccf635db6a494c0b934380e563a11f.scope: failed to identify the read-write layer ID for container "1f43e4af14b5bb32a708f5a49546ee69ceccf635db6a494c0b934380e563a11f". - open /var/lib/docker/image/overlay2/layerdb/mounts/1f43e4af14b5bb32a708f5a49546ee69ceccf635db6a494c0b934380e563a11f/mount-id: no such file or directory
kubernetes 1.10.2 Redhat docker 1.13.1 Storage Driver: overlay2 Kernel Version: 3.10.0-693.21.1.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo)
seems a race condition between kubelet start and docker start. I can reproduce by manually start kubelet and start docker.
--docker-root is deprecated, if docker daemon not running, kubelet failed to read from docker info, will kubelet use default docker graph to init cadvisor?
@jejer I think it just doesn't work at all if it can't initialize.
Where do you see --docker-root
deprecated?
BTW after I set --docker-root
in our system we never saw the problem again.
To fix the underlying problem we need some alternative to sync.Once
that will run again if it went wrong.
@bboreham , kubelet --help print --docker-root deprecated. and in the code: https://github.com/google/cadvisor/blob/f7576313bd1df261f90bca93ab91ba019eb823d5/container/docker/factory.go#L62
I will try set --docker-root, thank you!
"DEPRECATED: docker root is read from docker info (this is a fallback, default: /var/lib/docker)"
OK, that is exactly this bug - if Docker is not ready, the info is not available.
Note it was known at the time https://github.com/google/cadvisor/pull/1275#discussion_r62554982:
We cannot deprecate it until the clients of this package can deal with the fact that docker might not be available when cAdvisor is initialized.
added --docker-root for kubelet and never saw this issue again.
Setting --docker-root
solved the problem for us too. Too bad the flag is deprecated. Is there a plan to fix the underlying issue before that flag is removed?
--docker-root is already set in my case, but still facing the issue on some nodes.
Hi @Shivkumar13 did you resolve the issue in the end? We saw this issue if using prometheus
Hi @Shivkumar13 did you resolve the issue in the end? We saw this issue if using prometheus
That issue was not resolved properly at that time, but I rebooted the node as a workaround, and suddenly it started showing metrics.
k8s 1.13.3 Docker version 19.03.8 kernel 4.19.12-1.el7 遇到了同样的报错 --docker-root 解决了问题
I am facing same issue on rhel8 but using standalone cadvisor binary. podman version 4.0.2
Cannot use "docker-root" as it is deprecated.
Command: cadvisor -disable_root_cgroup_stats=false -docker_only=true -housekeeping_interval=30s -store_container_labels=false
Jul 05 23:33:40 cadvisor[3549756]: W0705 23:33:40.810522 3549756 manager.go:1168] Failed to process watch event {EventType:0 Name:/system.slice/46fd7e454d3469b4ab4c96c720cd034774c2c6ef142bc468724f24a7373fa078.service WatchSource:0}: failed to identify the read-write layer ID for container "46fd7e454d3469b4ab4c96c720cd034774c2c6ef142bc468724f24a7373fa078". - open /var/lib/containers/storage/image/overlay/layerdb/mounts/46fd7e454d3469b4ab4c96c720cd034774c2c6ef142bc468724f24a7373fa078/mount-id: no such file or directory Jul 05 23:33:40 cadvisor[3549756]: W0705 23:33:40.813298 3549756 manager.go:1168] Failed to process watch event {EventType:0 Name:/system.slice/367400503c1be728d62170e3ae67013dbf32f581ea883aedbd639c9c22c29877.service WatchSource:0}: failed to identify the read-write layer ID for container "367400503c1be728d62170e3ae67013dbf32f581ea883aedbd639c9c22c29877". - open /var/lib/containers/storage/image/overlay/layerdb/mounts/367400503c1be728d62170e3ae67013dbf32f581ea883aedbd639c9c22c29877/mount-id: no such file or directory Jul 05 23:33:40 cadvisor[3549756]: W0705 23:33:40.822480 3549756 manager.go:1168] Failed to process watch event {EventType:0 Name:/system.slice/5e19754c5258145b3e6313e4ac65624a083c2607a0d38f5650c1bd50b7f46e61.service WatchSource:0}: failed to identify the read-write layer ID for container "5e19754c5258145b3e6313e4ac65624a083c2607a0d38f5650c1bd50b7f46e61". - open /var/lib/containers/storage/image/overlay/layerdb/mounts/5e19754c5258145b3e6313e4ac65624a083c2607a0d38f5650c1bd50b7f46e61/mount-id: no such file or directory Jul 05 23:33:40 cadvisor[3549756]: W0705 23:33:40.826270 3549756 manager.go:1168] Failed to process watch event {EventType:0 Name:/system.slice/703dfb2ea8a9efb8cccd401ba89da2adea5c1f8e187bcf73a7b5d6cc92d7b7cb.service WatchSource:0}: failed to identify the read-write layer ID for container "703dfb2ea8a9efb8cccd401ba89da2adea5c1f8e187bcf73a7b5d6cc92d7b7cb". - open /var/lib/containers/storage/image/overlay/layerdb/mounts/703dfb2ea8a9efb8cccd401ba89da2adea5c1f8e187bcf73a7b5d6cc92d7b7cb/mount-id: no such file or directory
From https://github.com/kubernetes/kubernetes/issues/39812#issuecomment-381096812
The high-level symptom is that
curl /metrics
returns blank for image, namespace, etc.example line from broken node:
example line from working node:
We have seen this on two clusters in recent weeks, running k8s v1.9.3.
It appears that kubelet's view of the universe has diverged significantly from Docker's hence it does not have the metadata to tag container metrics. Lots of these in kubelet logs:
Restarting kubelet fixed the issue for the most recent incident. I'm told on another occasion it was necessary to drain, delete all docker files and restart.