google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.03k stars 2.31k forks source link

docker_only not working correctly #3037

Open flixr opened 2 years ago

flixr commented 2 years ago

I'm trying to run cAdvisor as a binary on the host (launched via systemd, not as docker container) to collect metrics of only my docker containers.

In order to get CPU usage down, I looked at #2523 and while increasing housekeeping interval helps, cAdvisor still seems to track quite a few things in the root namespace which might account for the higher CPU usage.

CAdvisor version: v0.41.0 (v0.40.0.53+9fae30700d53a3) Running with flags: --housekeeping_interval=10s --max_housekeeping_interval=15s --event_storage_event_limit=default=0 --event_storage_age_limit=default=0 --docker_only=true --enable_metrics=cpu,cpuLoad,diskIO,memory,network,oom_event,process --store_container_labels=false

I can still see all processes running on my host on the cAdvisor /containers/ endpoint under Processes. These should not be visible!

If I add --disable_root_cgroup_stats=true I get errors (similar to #2341 )

W1229 14:55:53.747106 2683142 container.go:489] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache
W1229 14:55:55.379517 2683142 manager.go:705] Error getting data for container / because of race condition
W1229 14:55:56.770975 2683142 manager.go:705] Error getting data for container / because of race condition

The endpoint /containers/ just shows failed to get container "/" with error: unable to find data in memory cache, which is a bit annoying since this is the default page, but not really a problem.

Why is cAdvisor always adding a "root container" / which does not exist (as a container)?

Validate output:

cAdvisor version: v0.40.0.53+9fae30700d53a3

OS version: Ubuntu 20.04.3 LTS

Kernel version: [Supported and recommended]
    Kernel version is 5.4.0-91-generic. Versions >= 2.6 are supported. 3.0+ are recommended.

Cgroup setup: [Supported and recommended]
    Available cgroups: map[blkio:1 cpu:1 cpuacct:1 cpuset:1 devices:1 freezer:1 hugetlb:1 memory:1 net_cls:1 net_prio:1 perf_event:1 pids:1 rdma:1]
    Following cgroups are required: [cpu cpuacct]
    Following other cgroups are recommended: [memory blkio cpuset devices freezer]
    Hierarchical memory accounting enabled. Reported memory usage includes memory used by child containers.
    Cpu cfs bandwidth is enabled.

Cgroup mount setup: [Supported and recommended]
    Cgroups are mounted at /sys/fs/cgroup.
    Cgroup mount directories: blkio cpu cpu,cpuacct cpuacct cpuset devices freezer hugetlb memory net_cls net_cls,net_prio net_prio perf_event pids rdma systemd unified 
    Any cgroup mount point that is detectible and accessible is supported. /sys/fs/cgroup is recommended as a standard location.
    Cgroup mounts:
    cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
    cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
    cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
    cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
    cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
    cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
    cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
    cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
    cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
    cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
    cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
    cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

Docker version: [Supported and recommended]
    Docker version is 20.10.12. Versions >= 1.0 are supported. 1.2+ are recommended.

Docker driver setup: [Supported and recommended]
    Storage driver is overlay2.

Block device setup: [Supported, but not recommended]
    None of the devices support 'cfq' I/O scheduler. No disk stats can be reported.
     Disk "nvme0n1" Scheduler type "none".
     Disk "sda" Scheduler type "mq-deadline".

Inotify watches: 

Managed containers: 
    /docker_limit.slice/docker-e994b91b167a82d0bf5ea7eac4aadf4af50cf0cddc0ebf1be8569799bb3a1128.scope
        Namespace: docker
        Aliases:
            grafana
            e994b91b167a82d0bf5ea7eac4aadf4af50cf0cddc0ebf1be8569799bb3a1128
    /docker_limit.slice/docker-87d5a871ac280684167ef01ef6636a21979e0e51af664bb3d1df4af07561d479.scope
        Namespace: docker
        Aliases:
            prometheus
            87d5a871ac280684167ef01ef6636a21979e0e51af664bb3d1df4af07561d479
    /docker_limit.slice/docker-9ac87d2544d6087fe44797e59190d8325eafa9f0c38b2eeae26299283c42814a.scope
        Namespace: docker
        Aliases:
            hello_rc_cube
            9ac87d2544d6087fe44797e59190d8325eafa9f0c38b2eeae26299283c42814a
    /
Creatone commented 2 years ago

Do you have still issues?

flixr commented 2 years ago

Yes. I haven't tried a newer version though... Do you expect this is fixed in a later release? Then I can try to check again... The cadvisor 0.43 binary required a newer glibc than I have on these systems, so I used 0.41. I guess I would need to try to build the latest version from source to get it running.

hhromic commented 1 year ago

This is still happening as of cAdvisor 0.47.0. Very annoying error :(

flixr commented 1 year ago

Yes, also still having this problem with 0.47.0

iwankgb commented 1 year ago

I'm trying to understand the problem, so far I have tested following scenarios:

Podman container was started with non-root user.

@flixr, can you rephrase you problem, please? I'm struggling with understanding it.

flixr commented 1 year ago

I want cadvisor to collect only metrics for docker containers and use as little CPU as possible. So I run it with

cadvisor --port=9338 \
                                  --housekeeping_interval=10s \
                                  --max_housekeeping_interval=15s \
                                  --event_storage_event_limit=default=0 \
                                  --event_storage_age_limit=default=0 \
                                  --docker_only=true \
                                  --raw_cgroup_prefix_whitelist=/docker_limit.slice/ \
                                  --disable_root_cgroup_stats=true \
                                  --store_container_labels=false \
                                  --enable_metrics=cpu,cpuLoad,diskIO,memory,network,oom_event,process

Docker is running with "cgroup-parent": "docker_limit.slice" option where I limit CPU and mem for all containers.

With --disable_root_cgroup_stats=true as above I get:

Sep 13 12:20:21 rc-visard-ng-1421823001014 cadvisor[1407]: W0913 12:20:21.872992    1407 manager.go:694] Error getting data for container / because of race condition
Sep 13 12:20:22 rc-visard-ng-1421823001014 cadvisor[1407]: W0913 12:20:22.832694    1407 container.go:485] Failed to get RecentStats("/") while determining the next housekeeping: unable to find data in memory cache

and if I don't disable root cgroup stats, it collects - well - stats of root cgroup, so off all processes. That I don't want as it results in higher CPU usage.

demkkka commented 2 weeks ago

Hi there! It looks like an issue has been fixed, at least in the v0.49.1 version.