Open Qayo opened 4 years ago
There are a few things we probably want to check...
Can you share the cAdvisor log?
Can you share the docker inspect for the container? Just to verify that mesos-marathon is setting CPU shares
Can you check the cgroup setup on the node? If you look at the cadvisor metric, it should include the container id, which is the path in /sys/fs/cgroup/cpu/
. Then check the cpu.shares file in that cgroup to see what it contains
Hello,
Here is the docker inspect output:
# docker inspect ca1dd6fe1121
[
{
"Id": "ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53",
"Created": "2020-05-06T14:48:49.748425978Z",
"Path": "/usr/bin/cadvisor",
"Args": [
"-logtostderr"
],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 28173,
"ExitCode": 0,
"Error": "",
"StartedAt": "2020-05-06T14:48:50.196663407Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2020-05-10T23:57:02.794114363-07:00",
"End": "2020-05-10T23:57:02.88937386-07:00",
"ExitCode": 0,
"Output": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 2 100 2 0 0 2000 0 --:--:-- --:--:-- --:--:-- 2000\nok"
},
{
"Start": "2020-05-10T23:57:32.891822884-07:00",
"End": "2020-05-10T23:57:32.993163137-07:00",
"ExitCode": 0,
"Output": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 2 100 2 0 0 2000 0 --:--:-- --:--:-- --:--:-- 2000\nok"
},
{
"Start": "2020-05-10T23:58:02.995539059-07:00",
"End": "2020-05-10T23:58:03.095436853-07:00",
"ExitCode": 0,
"Output": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 2 100 2 0 0 2000 0 --:--:-- --:--:-- --:--:-- 2000\nok"
},
{
"Start": "2020-05-10T23:58:33.097826048-07:00",
"End": "2020-05-10T23:58:33.197799232-07:00",
"ExitCode": 0,
"Output": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 2 100 2 0 0 2000 0 --:--:-- --:--:-- --:--:-- 2000\nok"
},
{
"Start": "2020-05-10T23:59:03.199713141-07:00",
"End": "2020-05-10T23:59:03.290209724-07:00",
"ExitCode": 0,
"Output": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 2 100 2 0 0 2000 0 --:--:-- --:--:-- --:--:-- 2000\nok"
}
]
}
},
"Image": "sha256:752d61707eac173cfe56a23aa9de051597444286163667d60f8e6d4c63306472",
"ResolvConfPath": "/var/lib/docker/containers/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53/resolv.conf",
"HostnamePath": "/var/lib/docker/containers/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53/hostname",
"HostsPath": "/var/lib/docker/containers/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53/hosts",
"LogPath": "/var/lib/docker/containers/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53-json.log",
"Name": "/mesos-f8969e4d-ec8f-49ca-99d7-460ed0b10045",
"RestartCount": 0,
"Driver": "overlay2",
"Platform": "linux",
"MountLabel": "",
"ProcessLabel": "",
"AppArmorProfile": "",
"ExecIDs": null,
"HostConfig": {
"Binds": [
"/var/run:/var/run:rw",
"/var/lib/docker:/var/lib/docker:ro",
"/dev/disk:/dev/disk:ro",
"/sys/fs/cgroup/cpu,cpuacct:/sys/fs/cgroup/cpuacct,cpu:ro",
"/sys/fs/cgroup/memory:/sys/fs/cgroup/memory:ro",
"/var/lib/mesos/slaves/4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7/frameworks/0ae74862-90d9-42f5-a5bd-0560521c1914-0000/executors/cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3/runs/f8969e4d-ec8f-49ca-99d7-460ed0b10045:/mnt/mesos/sandbox",
"/:/rootfs:ro"
],
"ContainerIDFile": "",
"LogConfig": {
"Type": "json-file",
"Config": {}
},
"NetworkMode": "bridge",
"PortBindings": {
"8080/tcp": [
{
"HostIp": "",
"HostPort": "7070"
}
]
},
"RestartPolicy": {
"Name": "no",
"MaximumRetryCount": 0
},
"AutoRemove": false,
"VolumeDriver": "",
"VolumesFrom": null,
"CapAdd": null,
"CapDrop": null,
"Capabilities": null,
"Dns": [],
"DnsOptions": [],
"DnsSearch": [],
"ExtraHosts": null,
"GroupAdd": null,
"IpcMode": "private",
"Cgroup": "",
"Links": null,
"OomScoreAdj": 0,
"PidMode": "",
"Privileged": true,
"PublishAllPorts": false,
"ReadonlyRootfs": false,
"SecurityOpt": [
"label=disable"
],
"UTSMode": "",
"UsernsMode": "",
"ShmSize": 67108864,
"Runtime": "runc",
"ConsoleSize": [
0,
0
],
"Isolation": "",
"CpuShares": 512,
"Memory": 536870912,
"NanoCpus": 0,
"CgroupParent": "",
"BlkioWeight": 0,
"BlkioWeightDevice": [],
"BlkioDeviceReadBps": null,
"BlkioDeviceWriteBps": null,
"BlkioDeviceReadIOps": null,
"BlkioDeviceWriteIOps": null,
"CpuPeriod": 0,
"CpuQuota": 0,
"CpuRealtimePeriod": 0,
"CpuRealtimeRuntime": 0,
"CpusetCpus": "",
"CpusetMems": "",
"Devices": [],
"DeviceCgroupRules": null,
"DeviceRequests": null,
"KernelMemory": 0,
"KernelMemoryTCP": 0,
"MemoryReservation": 0,
"MemorySwap": 1073741824,
"MemorySwappiness": null,
"OomKillDisable": false,
"PidsLimit": null,
"Ulimits": null,
"CpuCount": 0,
"CpuPercent": 0,
"IOMaximumIOps": 0,
"IOMaximumBandwidth": 0,
"MaskedPaths": null,
"ReadonlyPaths": null
},
"GraphDriver": {
"Data": {
"LowerDir": "/var/lib/docker/overlay2/1255ae103bd91751163bd4d05e03085ccac3ad6999a8f9572ffef656efb2db1e-init/diff:/var/lib/docker/overlay2/6ea1a261d27ebae3b05b57d7aab941e161ce7f4b670bdcc0b1f105833de66bf3/diff:/var/lib/docker/overlay2/38bf623385ba11ebbd19abb3f8f48dba437be24a19769fdae3ff035e0051969b/diff:/var/lib/docker/overlay2/5536556e53ecdfbfd88e1e7ed98db4eb5cae4dc511bf4b60144de94acea39dd7/diff",
"MergedDir": "/var/lib/docker/overlay2/1255ae103bd91751163bd4d05e03085ccac3ad6999a8f9572ffef656efb2db1e/merged",
"UpperDir": "/var/lib/docker/overlay2/1255ae103bd91751163bd4d05e03085ccac3ad6999a8f9572ffef656efb2db1e/diff",
"WorkDir": "/var/lib/docker/overlay2/1255ae103bd91751163bd4d05e03085ccac3ad6999a8f9572ffef656efb2db1e/work"
},
"Name": "overlay2"
},
"Mounts": [
{
"Type": "bind",
"Source": "/",
"Destination": "/rootfs",
"Mode": "ro",
"RW": false,
"Propagation": "rslave"
},
{
"Type": "bind",
"Source": "/var/run",
"Destination": "/var/run",
"Mode": "rw",
"RW": true,
"Propagation": "rprivate"
},
{
"Type": "bind",
"Source": "/var/lib/docker",
"Destination": "/var/lib/docker",
"Mode": "ro",
"RW": false,
"Propagation": "rslave"
},
{
"Type": "bind",
"Source": "/dev/disk",
"Destination": "/dev/disk",
"Mode": "ro",
"RW": false,
"Propagation": "rprivate"
},
{
"Type": "bind",
"Source": "/sys/fs/cgroup/cpu,cpuacct",
"Destination": "/sys/fs/cgroup/cpuacct,cpu",
"Mode": "ro",
"RW": false,
"Propagation": "rprivate"
},
{
"Type": "bind",
"Source": "/sys/fs/cgroup/memory",
"Destination": "/sys/fs/cgroup/memory",
"Mode": "ro",
"RW": false,
"Propagation": "rprivate"
},
{
"Type": "bind",
"Source": "/var/lib/mesos/slaves/4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7/frameworks/0ae74862-90d9-42f5-a5bd-0560521c1914-0000/executors/cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3/runs/f8969e4d-ec8f-49ca-99d7-460ed0b10045",
"Destination": "/mnt/mesos/sandbox",
"Mode": "",
"RW": true,
"Propagation": "rprivate"
}
],
"Config": {
"Hostname": "ca1dd6fe1121",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": true,
"AttachStderr": true,
"ExposedPorts": {
"8080/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"MARATHON_APP_DOCKER_IMAGE=docker-registry.marathon.rsshpc1prd.sc1.roche.com:5000/cadvisor:v0.33.0",
"MARATHON_APP_ID=/cadvisor",
"MARATHON_APP_RESOURCE_DISK=0.0",
"MARATHON_APP_RESOURCE_GPUS=0",
"MARATHON_APP_LABELS=",
"MESOS_CONTAINER_NAME=mesos-f8969e4d-ec8f-49ca-99d7-460ed0b10045",
"MESOS_SANDBOX=/mnt/mesos/sandbox",
"MESOS_TASK_ID=cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3",
"PORT=7070",
"PORT0=7070",
"PORTS=7070",
"HOST=lb023mesos.eth.rsshpc1.sc1.science.roche.com",
"MARATHON_APP_RESOURCE_CPUS=0.5",
"MARATHON_APP_RESOURCE_MEM=512.0",
"MARATHON_APP_VERSION=2020-05-06T14:48:46.353Z",
"PORT_8080=7070",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"GLIBC_VERSION=2.28-r0"
],
"Cmd": null,
"Healthcheck": {
"Test": [
"CMD-SHELL",
"curl -f http://localhost:8080/healthz || exit 1"
],
"Interval": 30000000000,
"Timeout": 3000000000
},
"Image": "docker-registry.marathon.rsshpc1prd.sc1.roche.com:5000/cadvisor:v0.33.0",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": [
"/usr/bin/cadvisor",
"-logtostderr"
],
"OnBuild": null,
"Labels": {
"MESOS_TASK_ID": "cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3"
}
},
"NetworkSettings": {
"Bridge": "",
"SandboxID": "fa0b73f43783d3a0f96aa8737eb0904181fd26dfd905e751870465da3401d243",
"HairpinMode": false,
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
"Ports": {
"8080/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "7070"
}
]
},
"SandboxKey": "/var/run/docker/netns/fa0b73f43783",
"SecondaryIPAddresses": null,
"SecondaryIPv6Addresses": null,
"EndpointID": "f0c32cc7f735ed14cc3f877debe90dbb8f098448926cf9b2b070bb1a280b1ce2",
"Gateway": "172.17.0.1",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"MacAddress": "02:42:ac:11:00:02",
"Networks": {
"bridge": {
"IPAMConfig": null,
"Links": null,
"Aliases": null,
"NetworkID": "a833454843d54d3751190ebcbaa78e8dab4ad731756390b5fb781bc1d8430ae2",
"EndpointID": "f0c32cc7f735ed14cc3f877debe90dbb8f098448926cf9b2b070bb1a280b1ce2",
"Gateway": "172.17.0.1",
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"MacAddress": "02:42:ac:11:00:02",
"DriverOpts": null
}
}
}
}
]
The cpu.shares file exists and it contains the correct number:
# pwd /sys/fs/cgroup/cpu/docker/ca1dd6fe112193c5e42f5821986b7cdd7b7472e74e761f31cb4253a7a6329f53
# cat cpu.shares 512
Regarding the logs, this is the content of /var/lib/mesos/slaves/4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7/frameworks/0ae74862-90d9-42f5-a5bd-0560521c1914-0000/executors/cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3/runs/f8969e4d-ec8f-49ca-99d7-460ed0b10045/stderr
I0506 07:48:49.661885 28050 exec.cpp:162] Version: 1.7.0 I0506 07:48:49.666883 28070 exec.cpp:236] Executor registered on agent 4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7 I0506 07:48:49.668123 28074 executor.cpp:130] Registered docker executor on lb023mesos.eth.rsshpc1.sc1.science.roche.com I0506 07:48:49.668407 28075 executor.cpp:186] Starting task cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3
While /var/lib/mesos/slaves/4d7b0ae6-e93a-4c78-a3cf-762a52f188ac-S7/frameworks/0ae74862-90d9-42f5-a5bd-0560521c1914-0000/executors/cadvisor.b05e50ec-8fa8-11ea-a280-005056ae54b3/runs/f8969e4d-ec8f-49ca-99d7-460ed0b10045/stdout is empty.
Also, I forgot to say that we are using advisor version 0.33 downloaded from Docker Hub 1 year ago. I've seen the original repository is deprecated, but I can't find the new docker images in Docker Hub.
Thanks!
New images are only hosted in gcr.io. See the readme.
If you scrape the cAdvisor endpoint, do you see any metrics with the container id listed above? I'm trying to figure out if you are just missing some metadata from the metric, or if the metric isn't present at all
This is what it looks like. There is no CPU info.
I compared with another cluster, which hasn't been upgraded, and there it does appear the CPU shares:
Its strange that you don't have any logs for cAdvisor...
Do you know what version you were using before the upgrade? Can you try using the latest version (gcr.io/google_containers/cadvisor:v0.36.0), and see if the problem still happens?
Hello,
Sorry, cadvisor version was always 0.33. What we upgraded was Docker from 1.13.0 to 19.03.6; sorry if that was not clear. In any case, I will try to upgrade to 0.36 and come back to you. Thanks!
Hello,
I upgraded to version 0.36, but unfortunately, the issue persists.
I haven't seen this bug before. Here is where the shares should be collected: https://github.com/google/cadvisor/blob/6a8d61401ea994338e41b013fb353ded17f87269/container/common/helpers.go#L100
If you can get logs from cAdvisor, that would help diagnose the issue. It may require increase the logging verbosity to --v=4 to get more details.
Here is the log output after increasing the verbosity:
sadly I can't see anything interesting in the logs. Just to confirm, you are missing all cpu share metrics, right?
Yes, that is the problem.
Our docker version: 18.09.6 still works on a centos 7.6.1810. Issue is only with the UCR containers here, going to look if there is already a report from that looks like other issue.
Hello, we are using cadvisor to monitor memory and cpu usage of containers. It was working pretty well with docker version 1.13.0 and CentOS 7.4. However, we upgraded recently to docker 19.03.6 and CentOS 7.7 and we cannot get the reserved CPU per container (we run the containers through Mesos-Marathon).
Before, we could use this metric in Grafana:
container_spec_cpu_shares{container_label_MESOS_TASK_ID!=""}/1024
and this would give us the CPUs reserved per container in Mesos, which was quite useful for us. Right now, it is not returning any value.Is this a compatibility issue with Docker, fixed with a newer cadvisor version? Perhaps we have to tweak things in Docker?
Thanks in advance!