Closed iglov closed 1 year ago
Hmmm... This might be a bug with --disable_root_cgroup_stats
. The message seems to indicate that it is looking for stats for the root ("/") cgroup. You can probably get rid of the error message by not adding that parameter.
I'll triage this and look into it when I can find time. Or, if you are interested, you can try and dig into it and submit a fix. The error is coming from nextHousekeepingInterval()
, and is called as part of housekeeping()
.
Other than the log message, what symptoms are you seeing? Are any metrics you expect missing?
Hey @dashpole !
Nah, i do not want to shut off disable_root_cgroup_stats
cuz i do not want get 100500 cgroup metrics :)
I do not know how exactly it affected on my monitoring, i thought somebody here tell me, i just see this error on logs and it is worry me :)
P.S. Unfortunately i am not a programmer and i have no idea how to fix it :(
@dashpole Is there any progress on this? we got the same issue.
@dongwangdw are there any other symptoms other than the log message?
@dashpole We're also using the the flag since 1-2 weeks and apart from the log message we see no symptoms.
ok, i'm 90% sure we just need to lower the log verbosity when running with --disable_root_cgroup_stats
@dashpole The error messages are below.
E0216 04:10:40.342721 8215 cadvisor_stats_provider.go:440] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/libcontainer_61055_systemd_test_default.slice": RecentStats: unable to find data in memory cache]
hyperkube: E0217 00:04:33.996461 8215 helpers.go:137] readString: Failed to read "/sys/fs/cgroup/memory/libcontainer_108824_systemd_test_default.slice/memory.limit_in_bytes": read /sys/fs/cgroup/memory/libcontainer_108824_systemd_test_default.slice/memory.limit_in_bytes: no such device
@dongwangdw that looks unrelated to this issue. Feel free to open a new issue if you would like.
The following shows up in the logs when I activate
W0929 12:48:23.346172 1 container.go:448] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W0929 12:48:29.519371 1 prometheus.go:1789] Couldn't get containers: partial failures: ["/": containerDataToContainerInfo: unable to find data in memory cache]
In addition, most / many metrics disappear. Is this related?
"--allow_dynamic_housekeeping=true",
"--global_housekeeping_interval=1m0s",
"--housekeeping_interval=10s",
# Currently buggy. https://github.com/google/cadvisor/issues/2602.
# "--disable_root_cgroup_stats=true",
"--raw_cgroup_prefix_whitelist=/ecs",
"--docker_only=true",
"--store_container_labels=false",
join("", [
"--whitelisted_container_labels='",
"com.amazonaws.ecs.container-name,",
"com.amazonaws.ecs.task-definition-family,",
"promstack.namespace,",
"promstack.alias,",
"promstack.api_type,",
"'"
]),
"--disable_metrics=tcp,advtcp,udp,sched,hugetlb,disk,diskIO,accelerator,resctrl",
@trallnag I'm not sure what you changed but maybe our experience can help. We upgraded from cadvisor 0.36 to 0.37 and all container_ metrics disappeared.
We use cadvisor in Kubernetes with containerd. Removing the --disable_root_cgroup_stats
option solved our problem and we got container_ metrics again.
Hey @dashpole @iglov
We run into the exact same issue, when we wanted to upgrade from v0.33.0
to v0.36.0
docker logs
command shows following warnings, in about 1 minute intervals:
W1030 19:58:24.826509 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 19:59:26.022628 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:00:27.176149 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:01:27.980109 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:02:29.232640 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:03:29.366618 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:04:30.211497 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:05:30.825290 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:06:31.976701 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:07:33.350365 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:08:33.915653 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:09:34.812295 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:10:35.357128 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:11:35.930555 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1030 20:12:37.640540 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
Also cAdvisor UI is broken. Browser shows just this message:
failed to get container "/" with error: unable to find data in memory cache
However exported metrics and Prometheus /metrics
endpoint are Ok.
cAdvisor:
sudo docker run --name cadvisor_test -d --restart=always --volume=/:/rootfs:ro --volume=/var/run:/var/run:ro --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --volume=/dev/disk/:/dev/disk:ro --publish=8082:8080 gcr.io/cadvisor/cadvisor:v0.36.0 --docker_only=true --store_container_labels=false --disable_root_cgroup_stats=true --v=0
Notes:
v0.34.0
version--disable_root_cgroup_stats=true
fixes the problem--v=0
adds no other log messages, which is strange OS: No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.2 LTS Release: 18.04 Codename: bionic
Docker:
$ sudo docker --info
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:02:36 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:01:06 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.7
GitCommit: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
$ sudo docker info
Client:
Debug Mode: false
Server:
Containers: 23
Running: 23
Paused: 0
Stopped: 0
Images: 110
Server Version: 19.03.13
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-118-generic
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.66GiB
Name: XXX
ID: I3VV:H32G:ZJHD:NOV5:A424:VXV...
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
I run cAdvisor again with the --v=99
flag (now correctly bumping the verbosity up :)). Whole output is too long to post here, so I chose snippet around W1030
Warning: There are some Error messages, which might help ...
docker logs output:
...
[{Size:26214400 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None}
I1030 20:54:39.288789 1 manager.go:199] Version: {KernelVersion:4.15.0-118-generic ContainerOsVersion:Alpine Linux v3.10 DockerVersion:19.03.13 DockerAPIVersion:1.40 CadvisorVersion:v0.36.0 CadvisorRevision:4fe450a2}
I1030 20:54:39.291301 1 factory.go:123] Registration of the mesos container factory failed: unable to create mesos agent client: failed to get version
I1030 20:54:39.291331 1 factory.go:54] Registering systemd factory
I1030 20:54:39.294215 1 factory.go:137] Registering containerd factory
I1030 20:54:39.294424 1 factory.go:123] Registration of the crio container factory failed: Get http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info: dial unix /var/run/crio/crio.sock: connect: no such file or directory
I1030 20:54:39.324261 1 factory.go:369] Registering Docker factory
I1030 20:54:39.324763 1 factory.go:101] Registering Raw factory
I1030 20:54:39.325248 1 manager.go:1158] Started watching for new ooms in manager
I1030 20:54:39.330242 1 nvidia.go:53] No NVIDIA devices found.
I1030 20:54:39.330329 1 factory.go:167] Error trying to work out if we can handle /: / not handled by systemd handler
I1030 20:54:39.330345 1 factory.go:178] Factory "systemd" was unable to handle container "/"
I1030 20:54:39.330387 1 factory.go:178] Factory "containerd" was unable to handle container "/"
I1030 20:54:39.330400 1 factory.go:178] Factory "docker" was unable to handle container "/"
I1030 20:54:39.330419 1 factory.go:174] Using factory "raw" for container "/"
I1030 20:54:39.331484 1 manager.go:950] Added container: "/" (aliases: [], namespace: "")
I1030 20:54:39.332237 1 handler.go:325] Added event &{/ 2020-10-30 15:34:44.043969861 +0000 UTC containerCreation {<nil>}}
I1030 20:54:39.332367 1 manager.go:272] Starting recovery of all containers
I1030 20:54:39.332722 1 container.go:467] Start housekeeping for container "/"
W1030 20:54:39.332981 1 container.go:425] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
I1030 20:54:39.362562 1 factory.go:167] Error trying to work out if we can handle /docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333: /docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333 not handled by systemd handler
I1030 20:54:39.362597 1 factory.go:178] Factory "systemd" was unable to handle container "/docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333"
I1030 20:54:39.363544 1 factory.go:167] Error trying to work out if we can handle /docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333: failed to load container: container "b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333" in namespace "k8s.io": not found
I1030 20:54:39.363574 1 factory.go:178] Factory "containerd" was unable to handle container "/docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333"
I1030 20:54:39.366333 1 factory.go:174] Using factory "docker" for container "/docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333"
I1030 20:54:39.368321 1 manager.go:950] Added container: "/docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333" (aliases: [qxxx b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333], namespace: "docker")
I1030 20:54:39.369107 1 handler.go:325] Added event &{/docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333 2020-05-26 13:59:15.572317419 +0000 UTC containerCreation {<nil>}}
I1030 20:54:39.369439 1 factory.go:167] Error trying to work out if we can handle /system.slice/snapd.seeded.service: /system.slice/snapd.seeded.service not handled by systemd handler
I1030 20:54:39.369748 1 factory.go:178] Factory "systemd" was unable to handle container "/system.slice/snapd.seeded.service"
I1030 20:54:39.370035 1 factory.go:178] Factory "containerd" was unable to handle container "/system.slice/snapd.seeded.service"
I1030 20:54:39.370483 1 factory.go:178] Factory "docker" was unable to handle container "/system.slice/snapd.seeded.service"
I1030 20:54:39.369507 1 container.go:467] Start housekeeping for container "/docker/b05469a97e2c679d51acc3966437695144f642497dab1cef910a0064f1c39333"
I1030 20:54:39.370731 1 factory.go:171] Factory "raw" can handle container "/system.slice/snapd.seeded.service", but ignoring.
I1030 20:54:39.371142 1 manager.go:908] ignoring container "/system.slice/snapd.seeded.service"
I1030 20:54:39.371178 1 factory.go:167] Error trying to work out if we can handle /system.slice/cloud-init.service: /system.slice/cloud-init.service not handled by systemd handler
I1030 20:54:39.371198 1 factory.go:178] Factory "systemd" was unable to handle container "/system.slice/cloud-init.service"
I1030 20:54:39.371218 1 factory.go:178] Factory "containerd" was unable to handle container "/system.slice/cloud-init.service"
I1030 20:54:39.371242 1 factory.go:178] Factory "docker" was unable to handle container "/system.slice/cloud-init.service"
I1030 20:54:39.371263 1 factory.go:171] Factory "raw" can handle container "/system.slice/cloud-init.service", but ignoring.
I1030 20:54:39.371290 1 manager.go:908] ignoring container "/system.slice/cloud-init.service"
I1030 20:54:39.371317 1 factory.go:167] Error trying to work out if we can handle /docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947: /docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947 not handled by systemd handler
I1030 20:54:39.371340 1 factory.go:178] Factory "systemd" was unable to handle container "/docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947"
I1030 20:54:39.372072 1 factory.go:167] Error trying to work out if we can handle /docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947: failed to load container: container "317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947" in namespace "k8s.io": not found
I1030 20:54:39.372098 1 factory.go:178] Factory "containerd" was unable to handle container "/docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947"
I1030 20:54:39.374996 1 factory.go:174] Using factory "docker" for container "/docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947"
I1030 20:54:39.377106 1 manager.go:950] Added container: "/docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947" (aliases: [gxxx 317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947], namespace: "docker")
I1030 20:54:39.377962 1 handler.go:325] Added event &{/docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947 2020-05-06 07:43:01.701829768 +0000 UTC containerCreation {<nil>}}
I1030 20:54:39.378270 1 factory.go:167] Error trying to work out if we can handle /system.slice/systemd-networkd.service: /system.slice/systemd-networkd.service not handled by systemd handler
I1030 20:54:39.378402 1 factory.go:178] Factory "systemd" was unable to handle container "/system.slice/systemd-networkd.service"
I1030 20:54:39.378536 1 factory.go:178] Factory "containerd" was unable to handle container "/system.slice/systemd-networkd.service"
I1030 20:54:39.378674 1 factory.go:178] Factory "docker" was unable to handle container "/system.slice/systemd-networkd.service"
I1030 20:54:39.378807 1 factory.go:171] Factory "raw" can handle container "/system.slice/systemd-networkd.service", but ignoring.
I1030 20:54:39.378950 1 manager.go:908] ignoring container "/system.slice/systemd-networkd.service"
I1030 20:54:39.379078 1 factory.go:171] Factory "systemd" can handle container "/system.slice/sys-fs-fuse-connections.mount", but ignoring.
I1030 20:54:39.379315 1 manager.go:908] ignoring container "/system.slice/sys-fs-fuse-connections.mount"
I1030 20:54:39.379474 1 factory.go:171] Factory "systemd" can handle container "/system.slice/dev-mqueue.mount", but ignoring.
I1030 20:54:39.378358 1 container.go:467] Start housekeeping for container "/docker/317ac8a9aeba3a7df7c2933becf93da3801765c958df1b80e4b4b8ecebb5d947"
I1030 20:54:39.379678 1 manager.go:908] ignoring container "/system.slice/dev-mqueue.mount"
I1030 20:54:39.380064 1 factory.go:167] Error trying to work out if we can handle /system.slice/grub-common.service: /system.slice/grub-common.service not handled by systemd handler
I1030 20:54:39.380094 1 factory.go:178] Factory "systemd" was unable to handle container "/system.slice/grub-common.service"
I1030 20:54:39.380116 1 factory.go:178] Factory "containerd" was unable to handle container "/system.slice/grub-common.service"
I1030 20:54:39.380136 1 factory.go:178] Factory "docker" was unable to handle container "/system.slice/grub-common.service"
I1030 20:54:39.380159 1 factory.go:171] Factory "raw" can handle container "/system.slice/grub-common.service", but ignoring.
I1030 20:54:39.380188 1 manager.go:908] ignoring container "/system.slice/grub-common.service"
I1030 20:54:39.380212 1 factory.go:167] Error trying to work out if we can handle /system.slice/snapd.socket: /system.slice/snapd.socket not handled by systemd handler
I1030 20:54:39.380232 1 factory.go:178] Factory "systemd" was unable to handle container "/system.slice/snapd.socket"
I1030 20:54:39.380250 1 factory.go:178] Factory "containerd" was unable to handle container "/system.slice/snapd.socket"
I1030 20:54:39.380266 1 factory.go:178] Factory "docker" was unable to handle container "/system.slice/snapd.socket"
I1030 20:54:39.380286 1 factory.go:171] Factory "raw" can handle container "/system.slice/snapd.socket", but ignoring.
I1030 20:54:39.380311 1 manager.go:908] ignoring container "/system.slice/snapd.socket"
I1030 20:54:39.380345 1 factory.go:167] Error trying to work out if we can handle /docker/71cc527a91e9c97649c8ad906b7af1b6bc9e1ed5f668c2855396f2e66fe71313: /docker/71cc527a91e9c97649c8ad906b7af1b6bc9e1ed5f668c2855396f2e66fe71313 not handled by systemd handler
I1030 20:54:39.380375 1 factory.go:178] Factory "systemd" was unable to handle container "/docker/71cc527a91e9c97649c8ad906b7af1b6bc9e1ed5f668c2855396f2e66fe71313"
I1030 20:54:39.380974 1 factory.go:167] Error trying to work out if we can handle /docker/71cc527a91e9c97649c8ad906b7af1b6bc9e1ed5f668c2855396f2e66fe7131
...
When I set --disable_root_cgroup_stats=true, container.go:448] failed to get recentstats("/") while determining the next housekeeping: unable to find data in memory cache
Docker version 20.10.2, build 2291f61
cAdvisor version v0.37.0 (65fa5b44)
Ubuntu 20.04.1 LTS
Having --disable_root_cgroup_stats=true
results in no container_* metrics and similar errors in logs as described above. We found this out why upgrading from 16.04 LTS that also included an upgrade of cAdvisor from 0.36 to 0.37 and Docker 19.03 to 20.10.
I've tried downgrading cAvisor without luck.
I tried to restart docker and kubelet,but it didn't work. I reboot the node, the cadvisor back to normal.
I have cadvisor runned as systemd unit on centos7 server, and on logs every minute i see an error:
cadvisor[103112]: W1120 14:47:56.678182 103112 container.go:422] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
What is this and how to fix it?OS: CentOS Linux release 7.7.1908 (Core) Cadvisor version: v0.34.0 (24a6a52f) Runned with flags:
--docker=unix:///var/run/docker.sock --listen_ip=192.168.49.177 --port=4194 --disable_root_cgroup_stats=true --docker_only=true --logtostderr=true
validate: