google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.03k stars 2.31k forks source link

How can I get the container reserved CPUs? #2536

Open Qayo opened 4 years ago

Qayo commented 4 years ago

Hello, we are using cadvisor to monitor memory and cpu usage of containers. It was working pretty well with docker version 1.13.0 and CentOS 7.4. However, we upgraded recently to docker 19.03.6 and CentOS 7.7 and we cannot get the reserved CPU per container (we run the containers through Mesos-Marathon).

Before, we could use this metric in Grafana: container_spec_cpu_shares{container_label_MESOS_TASK_ID!=""}/1024 and this would give us the CPUs reserved per container in Mesos, which was quite useful for us. Right now, it is not returning any value.

Is this a compatibility issue with Docker, fixed with a newer cadvisor version? Perhaps we have to tweak things in Docker?

Thanks in advance!

dashpole commented 4 years ago

There are a few things we probably want to check... Can you share the cAdvisor log? Can you share the docker inspect for the container? Just to verify that mesos-marathon is setting CPU shares Can you check the cgroup setup on the node? If you look at the cadvisor metric, it should include the container id, which is the path in /sys/fs/cgroup/cpu/. Then check the cpu.shares file in that cgroup to see what it contains

Qayo commented 4 years ago

Hello,

Here is the docker inspect output:

# docker inspect ca1dd6fe1121
[
    {
        "Id": "ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53",
        "Created": "2020-05-06T14:48:49.748425978Z",
        "Path": "/usr/bin/cadvisor",
        "Args": [
            "-logtostderr"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 28173,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2020-05-06T14:48:50.196663407Z",
            "FinishedAt": "0001-01-01T00:00:00Z",
            "Health": {
                "Status": "healthy",
                "FailingStreak": 0,
                "Log": [
                    {
                        "Start": "2020-05-10T23:57:02.794114363-07:00",
                        "End": "2020-05-10T23:57:02.88937386-07:00",
                        "ExitCode": 0,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100     2  100     2    0     0   2000      0 --:--:-- --:--:-- --:--:--  2000\nok"
                    },
                    {
                        "Start": "2020-05-10T23:57:32.891822884-07:00",
                        "End": "2020-05-10T23:57:32.993163137-07:00",
                        "ExitCode": 0,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100     2  100     2    0     0   2000      0 --:--:-- --:--:-- --:--:--  2000\nok"
                    },
                    {
                        "Start": "2020-05-10T23:58:02.995539059-07:00",
                        "End": "2020-05-10T23:58:03.095436853-07:00",
                        "ExitCode": 0,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100     2  100     2    0     0   2000      0 --:--:-- --:--:-- --:--:--  2000\nok"
                    },
                    {
                        "Start": "2020-05-10T23:58:33.097826048-07:00",
                        "End": "2020-05-10T23:58:33.197799232-07:00",
                        "ExitCode": 0,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100     2  100     2    0     0   2000      0 --:--:-- --:--:-- --:--:--  2000\nok"
                    },
                    {
                        "Start": "2020-05-10T23:59:03.199713141-07:00",
                        "End": "2020-05-10T23:59:03.290209724-07:00",
                        "ExitCode": 0,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100     2  100     2    0     0   2000      0 --:--:-- --:--:-- --:--:--  2000\nok"
                    }
                ]
            }
        },
        "Image": "sha256:752d61707eac173cfe56a23aa9de051597444286163667d60f8e6d4c63306472",
        "ResolvConfPath": "/var/lib/docker/containers/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53/hostname",
        "HostsPath": "/var/lib/docker/containers/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53/hosts",
        "LogPath": "/var/lib/docker/containers/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53-json.log",
        "Name": "/mesos-f8969e4d-ec8f-49ca-99d7-460ed0b10045",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/var/run:/var/run:rw",
                "/var/lib/docker:/var/lib/docker:ro",
                "/dev/disk:/dev/disk:ro",
                "/sys/fs/cgroup/cpu,cpuacct:/sys/fs/cgroup/cpuacct,cpu:ro",
                "/sys/fs/cgroup/memory:/sys/fs/cgroup/memory:ro",
                "/var/lib/mesos/slaves/4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7/frameworks/0ae74862-90d9-42f5-a5bd-0560521c1914-0000/executors/cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3/runs/f8969e4d-ec8f-49ca-99d7-460ed0b10045:/mnt/mesos/sandbox",
                "/:/rootfs:ro"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "bridge",
            "PortBindings": {
                "8080/tcp": [
                    {
                        "HostIp": "",
                        "HostPort": "7070"
                    }
                ]
            },
            "RestartPolicy": {
                "Name": "no",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "Capabilities": null,
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": true,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [
                "label=disable"
            ],
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 512,
            "Memory": 536870912,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": [],
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": 1073741824,
            "MemorySwappiness": null,
            "OomKillDisable": false,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": null,
            "ReadonlyPaths": null
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/1255ae103bd91751163bd4d05e03085ccac3ad6999a8f9572ffef656efb2db1e-init/diff:/var/lib/docker/overlay2/6ea1a261d27ebae3b05b57d7aab941e161ce7f4b670bdcc0b1f105833de66bf3/diff:/var/lib/docker/overlay2/38bf623385ba11ebbd19abb3f8f48dba437be24a19769fdae3ff035e0051969b/diff:/var/lib/docker/overlay2/5536556e53ecdfbfd88e1e7ed98db4eb5cae4dc511bf4b60144de94acea39dd7/diff",
                "MergedDir": "/var/lib/docker/overlay2/1255ae103bd91751163bd4d05e03085ccac3ad6999a8f9572ffef656efb2db1e/merged",
                "UpperDir": "/var/lib/docker/overlay2/1255ae103bd91751163bd4d05e03085ccac3ad6999a8f9572ffef656efb2db1e/diff",
                "WorkDir": "/var/lib/docker/overlay2/1255ae103bd91751163bd4d05e03085ccac3ad6999a8f9572ffef656efb2db1e/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/",
                "Destination": "/rootfs",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rslave"
            },
            {
                "Type": "bind",
                "Source": "/var/run",
                "Destination": "/var/run",
                "Mode": "rw",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/lib/docker",
                "Destination": "/var/lib/docker",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rslave"
            },
            {
                "Type": "bind",
                "Source": "/dev/disk",
                "Destination": "/dev/disk",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/sys/fs/cgroup/cpu,cpuacct",
                "Destination": "/sys/fs/cgroup/cpuacct,cpu",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/sys/fs/cgroup/memory",
                "Destination": "/sys/fs/cgroup/memory",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/lib/mesos/slaves/4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7/frameworks/0ae74862-90d9-42f5-a5bd-0560521c1914-0000/executors/cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3/runs/f8969e4d-ec8f-49ca-99d7-460ed0b10045",
                "Destination": "/mnt/mesos/sandbox",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        "Config": {
            "Hostname": "ca1dd6fe1121",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": true,
            "AttachStderr": true,
            "ExposedPorts": {
                "8080/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "MARATHON_APP_DOCKER_IMAGE=docker-registry.marathon.rsshpc1prd.sc1.roche.com:5000/cadvisor:v0.33.0",
                "MARATHON_APP_ID=/cadvisor",
                "MARATHON_APP_RESOURCE_DISK=0.0",
                "MARATHON_APP_RESOURCE_GPUS=0",
                "MARATHON_APP_LABELS=",
                "MESOS_CONTAINER_NAME=mesos-f8969e4d-ec8f-49ca-99d7-460ed0b10045",
                "MESOS_SANDBOX=/mnt/mesos/sandbox",
                "MESOS_TASK_ID=cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3",
                "PORT=7070",
                "PORT0=7070",
                "PORTS=7070",
                "HOST=lb023mesos.eth.rsshpc1.sc1.science.roche.com",
                "MARATHON_APP_RESOURCE_CPUS=0.5",
                "MARATHON_APP_RESOURCE_MEM=512.0",
                "MARATHON_APP_VERSION=2020-05-06T14:48:46.353Z",
                "PORT_8080=7070",
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "GLIBC_VERSION=2.28-r0"
            ],
            "Cmd": null,
            "Healthcheck": {
                "Test": [
                    "CMD-SHELL",
                    "curl -f http://localhost:8080/healthz || exit 1"
                ],
                "Interval": 30000000000,
                "Timeout": 3000000000
            },
            "Image": "docker-registry.marathon.rsshpc1prd.sc1.roche.com:5000/cadvisor:v0.33.0",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/usr/bin/cadvisor",
                "-logtostderr"
            ],
            "OnBuild": null,
            "Labels": {
                "MESOS_TASK_ID": "cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "fa0b73f43783d3a0f96aa8737eb0904181fd26dfd905e751870465da3401d243",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {
                "8080/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "7070"
                    }
                ]
            },
            "SandboxKey": "/var/run/docker/netns/fa0b73f43783",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "f0c32cc7f735ed14cc3f877debe90dbb8f098448926cf9b2b070bb1a280b1ce2",
            "Gateway": "172.17.0.1",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "172.17.0.2",
            "IPPrefixLen": 16,
            "IPv6Gateway": "",
            "MacAddress": "02:42:ac:11:00:02",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "a833454843d54d3751190ebcbaa78e8dab4ad731756390b5fb781bc1d8430ae2",
                    "EndpointID": "f0c32cc7f735ed14cc3f877debe90dbb8f098448926cf9b2b070bb1a280b1ce2",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:ac:11:00:02",
                    "DriverOpts": null
                }
            }
        }
    }
]

The cpu.shares file exists and it contains the correct number: # pwd /sys/fs/cgroup/cpu/docker/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53 # cat cpu.shares 512

Regarding the logs, this is the content of /var/lib/mesos/slaves/4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7/frameworks/0ae74862-90d9-42f5-a5bd-0560521c1914-0000/executors/cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3/runs/f8969e4d-ec8f-49ca-99d7-460ed0b10045/stderr

I0506 07:48:49.661885 28050 exec.cpp:162] Version: 1.7.0 I0506 07:48:49.666883 28070 exec.cpp:236] Executor registered on agent 4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7 I0506 07:48:49.668123 28074 executor.cpp:130] Registered docker executor on lb023mesos.eth.rsshpc1.sc1.science.roche.com I0506 07:48:49.668407 28075 executor.cpp:186] Starting task cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3 While /var/lib/mesos/slaves/4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7/frameworks/0ae74862-90d9-42f5-a5bd-0560521c1914-0000/executors/cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3/runs/f8969e4d-ec8f-49ca-99d7-460ed0b10045/stdout is empty.

Also, I forgot to say that we are using advisor version 0.33 downloaded from Docker Hub 1 year ago. I've seen the original repository is deprecated, but I can't find the new docker images in Docker Hub.

Thanks!

dashpole commented 4 years ago

New images are only hosted in gcr.io. See the readme.

If you scrape the cAdvisor endpoint, do you see any metrics with the container id listed above? I'm trying to figure out if you are just missing some metadata from the metric, or if the metric isn't present at all

Qayo commented 4 years ago

This is what it looks like. There is no CPU info.

new

I compared with another cluster, which hasn't been upgraded, and there it does appear the CPU shares:

old
dashpole commented 4 years ago

Its strange that you don't have any logs for cAdvisor...

Do you know what version you were using before the upgrade? Can you try using the latest version (gcr.io/google_containers/cadvisor:v0.36.0), and see if the problem still happens?

Qayo commented 4 years ago

Hello,

Sorry, cadvisor version was always 0.33. What we upgraded was Docker from 1.13.0 to 19.03.6; sorry if that was not clear. In any case, I will try to upgrade to 0.36 and come back to you. Thanks!

Qayo commented 4 years ago

Hello,

I upgraded to version 0.36, but unfortunately, the issue persists.

dashpole commented 4 years ago

I haven't seen this bug before. Here is where the shares should be collected: https://github.com/google/cadvisor/blob/6a8d61401ea994338e41b013fb353ded17f87269/container/common/helpers.go#L100

If you can get logs from cAdvisor, that would help diagnose the issue. It may require increase the logging verbosity to --v=4 to get more details.

Qayo commented 4 years ago

Here is the log output after increasing the verbosity:

stderr_cadvisor.txt

dashpole commented 4 years ago

sadly I can't see anything interesting in the logs. Just to confirm, you are missing all cpu share metrics, right?

Qayo commented 4 years ago

Yes, that is the problem.

a-nldisr commented 4 years ago

Our docker version: 18.09.6 still works on a centos 7.6.1810. Issue is only with the UCR containers here, going to look if there is already a report from that looks like other issue.