google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.95k stars 2.31k forks source link

Can't collect mesos container metrics about GPU if mesos container starts after cadvisor starts #2175

Open tongbinxiang opened 5 years ago

tongbinxiang commented 5 years ago

Mesos container GPU metrics can be got if mesos containers starts before cadvisor starts, but can't get GPU metrics if mesos containers starts after cadvisor starts. If I need to get GPU metrics, I must restart cadvisor

dashpole commented 5 years ago

cc @sashankreddya who has dome most of the mesos integration.

Can you share the cadvisor log from when it starts up and doesn't get GPU metrics and when you restart it and it does?

There are a couple of potential reasons this could be happening that I can think of:

  1. You are attaching a GPU to the machine/VM after starting cAdvisor. cAdvisor doesn't initialize NVML unless there are GPUs present at startup.
  2. Something strange with the mesos integration.

We should try and rule out (1) first. For (2), I wonder if mesos does some late initialization of GPU devices or something? The only reason cAdvisor wouldn't monitor the GPU is if it wasn't present when the container is initially discovered by cAdvisor.

tongbinxiang commented 5 years ago

cadvisor log is as follows: I0219 09:20:37.933860 1 storagedriver.go:50] Caching stats in memory for 2m0s I0219 09:20:37.934186 1 manager.go:151] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct" I0219 09:20:38.018088 1 fs.go:139] Filesystem UUIDs: map[79632962-af28-491d-ad41-8147a6c26188:/dev/sda1 a0de5d03-6714-41e9-abae-fc1219131e26:/dev/sdb1 f09b9eec-fc73-48a4-bd46-08c5f71543c2:/dev/sda2] I0219 09:20:38.018134 1 fs.go:140] Filesystem partitions: map[/dev/sdb1:{mountpoint:/rootfs/home major:8 minor:17 fsType:ext4 blockSize:0} shm:{mountpoint:/rootfs/var/lib/docker/containers/245c6964608dada9c65d6ba95d293419e43e13a4d99332de0e228dae7ae0916d/mounts/shm major:0 minor:56 fsType:tmpfs blockSize:0} tmpfs:{mountpoint:/dev major:0 minor:59 fsType:tmpfs blockSize:0} /dev/sda1:{mountpoint:/var/lib/docker major:8 minor:1 fsType:ext4 blockSize:0}] I0219 09:20:38.025345 1 manager.go:225] Machine: {NumCores:40 CpuFrequency:3400000 MemoryCapacity:135081676800 HugePages:[{PageSize:1048576 NumPages:0} {PageSize:2048 NumPages:0}] MachineID:398a0844cc22fd0844eb7106582008b0 SystemUUID:4C4C4544-005A-3910-8031-B4C04F4D4732 BootID:26cbef0a-5a1f-4b75-9bfe-9851ccd47692 Filesystems:[{Device:/dev/sdb1 DeviceMajor:8 DeviceMinor:17 Capacity:787474333696 Type:vfs Inodes:48840704 HasInodes:true} {Device:shm DeviceMajor:0 DeviceMinor:56 Capacity:67108864 Type:vfs Inodes:16489462 HasInodes:true} {Device:none DeviceMajor:0 DeviceMinor:55 Capacity:779633565696 Type:vfs Inodes:48357376 HasInodes:true} {Device:tmpfs DeviceMajor:0 DeviceMinor:59 Capacity:67108864 Type:vfs Inodes:16489462 HasInodes:true} {Device:/dev/sda1 DeviceMajor:8 DeviceMinor:1 Capacity:779633565696 Type:vfs Inodes:48357376 HasInodes:true}] DiskMap:map[8:0:{Name:sda Major:8 Minor:0 Size:800166076416 Scheduler:deadline} 8:16:{Name:sdb Major:8 Minor:16 Size:800166076416 Scheduler:deadline}] NetworkDevices:[{Name:eno1 MacAddress:f4:8e:38:ce:eb:64 Speed:1000 Mtu:1500} {Name:eno2 MacAddress:f4:8e:38:ce:eb:66 Speed:-1 Mtu:1500} {Name:enp20s0f0 MacAddress:a0:36:9f:c9:ad:ac Speed:10000 Mtu:1500} {Name:enp20s0f1 MacAddress:a0:36:9f:c9:ad:ae Speed:-1 Mtu:1500} {Name:mesos168088 MacAddress:f4:8e:38:ce:eb:64 Speed:10000 Mtu:1500} {Name:mesos61637 MacAddress:f4:8e:38:ce:eb:64 Speed:10000 Mtu:1500}] Topology:[{Id:0 Memory:67450953728 Cores:[{Id:0 Threads:[0 20] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:1 Threads:[2 22] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:2 Threads:[4 24] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:3 Threads:[6 26] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:4 Threads:[8 28] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:8 Threads:[10 30] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:9 Threads:[12 32] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:10 Threads:[14 34] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:11 Threads:[16 36] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:12 Threads:[18 38] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]} {Id:1 Memory:67630723072 Cores:[{Id:0 Threads:[1 21] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:1 Threads:[3 23] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:2 Threads:[5 25] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:3 Threads:[7 27] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:4 Threads:[9 29] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:8 Threads:[11 31] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:9 Threads:[13 33] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:10 Threads:[15 35] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:11 Threads:[17 37] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:12 Threads:[19 39] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None} I0219 09:20:38.026413 1 manager.go:231] Version: {KernelVersion:4.4.0-141-generic ContainerOsVersion:Alpine Linux v3.4 DockerVersion:18.03.1-ce DockerAPIVersion:1.37 CadvisorVersion:v0.28.3 CadvisorRevision:1e567c2} I0219 09:20:38.055068 1 factory.go:356] Registering Docker factory I0219 09:20:40.055562 1 factory.go:54] Registering systemd factory I0219 09:20:40.057960 1 factory.go:86] Registering Raw factory I0219 09:20:40.059964 1 manager.go:1178] Started watching for new ooms in manager I0219 09:20:40.067562 1 nvidia.go:110] NVML initialized. Number of nvidia devices: 8 I0219 09:20:40.127912 1 manager.go:329] Starting recovery of all containers I0219 09:20:40.290675 1 manager.go:334] Recovery completed I0219 09:20:40.483756 1 cadvisor.go:162] Starting cAdvisor version: v0.28.3-1e567c2 on port 8080

tongbinxiang commented 5 years ago

After cadvsior restarts, cadvsior log is as follows:

I0219 09:20:37.933860 1 storagedriver.go:50] Caching stats in memory for 2m0s I0219 09:20:37.934186 1 manager.go:151] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct" I0219 09:20:38.018088 1 fs.go:139] Filesystem UUIDs: map[79632962-af28-491d-ad41-8147a6c26188:/dev/sda1 a0de5d03-6714-41e9-abae-fc1219131e26:/dev/sdb1 f09b9eec-fc73-48a4-bd46-08c5f71543c2:/dev/sda2] I0219 09:20:38.018134 1 fs.go:140] Filesystem partitions: map[/dev/sdb1:{mountpoint:/rootfs/home major:8 minor:17 fsType:ext4 blockSize:0} shm:{mountpoint:/rootfs/var/lib/docker/containers/245c6964608dada9c65d6ba95d293419e43e13a4d99332de0e228dae7ae0916d/mounts/shm major:0 minor:56 fsType:tmpfs blockSize:0} tmpfs:{mountpoint:/dev major:0 minor:59 fsType:tmpfs blockSize:0} /dev/sda1:{mountpoint:/var/lib/docker major:8 minor:1 fsType:ext4 blockSize:0}] I0219 09:20:38.025345 1 manager.go:225] Machine: {NumCores:40 CpuFrequency:3400000 MemoryCapacity:135081676800 HugePages:[{PageSize:1048576 NumPages:0} {PageSize:2048 NumPages:0}] MachineID:398a0844cc22fd0844eb7106582008b0 SystemUUID:4C4C4544-005A-3910-8031-B4C04F4D4732 BootID:26cbef0a-5a1f-4b75-9bfe-9851ccd47692 Filesystems:[{Device:/dev/sdb1 DeviceMajor:8 DeviceMinor:17 Capacity:787474333696 Type:vfs Inodes:48840704 HasInodes:true} {Device:shm DeviceMajor:0 DeviceMinor:56 Capacity:67108864 Type:vfs Inodes:16489462 HasInodes:true} {Device:none DeviceMajor:0 DeviceMinor:55 Capacity:779633565696 Type:vfs Inodes:48357376 HasInodes:true} {Device:tmpfs DeviceMajor:0 DeviceMinor:59 Capacity:67108864 Type:vfs Inodes:16489462 HasInodes:true} {Device:/dev/sda1 DeviceMajor:8 DeviceMinor:1 Capacity:779633565696 Type:vfs Inodes:48357376 HasInodes:true}] DiskMap:map[8:0:{Name:sda Major:8 Minor:0 Size:800166076416 Scheduler:deadline} 8:16:{Name:sdb Major:8 Minor:16 Size:800166076416 Scheduler:deadline}] NetworkDevices:[{Name:eno1 MacAddress:f4:8e:38:ce:eb:64 Speed:1000 Mtu:1500} {Name:eno2 MacAddress:f4:8e:38:ce:eb:66 Speed:-1 Mtu:1500} {Name:enp20s0f0 MacAddress:a0:36:9f:c9:ad:ac Speed:10000 Mtu:1500} {Name:enp20s0f1 MacAddress:a0:36:9f:c9:ad:ae Speed:-1 Mtu:1500} {Name:mesos168088 MacAddress:f4:8e:38:ce:eb:64 Speed:10000 Mtu:1500} {Name:mesos61637 MacAddress:f4:8e:38:ce:eb:64 Speed:10000 Mtu:1500}] Topology:[{Id:0 Memory:67450953728 Cores:[{Id:0 Threads:[0 20] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:1 Threads:[2 22] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:2 Threads:[4 24] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:3 Threads:[6 26] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:4 Threads:[8 28] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:8 Threads:[10 30] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:9 Threads:[12 32] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:10 Threads:[14 34] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:11 Threads:[16 36] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:12 Threads:[18 38] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]} {Id:1 Memory:67630723072 Cores:[{Id:0 Threads:[1 21] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:1 Threads:[3 23] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:2 Threads:[5 25] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:3 Threads:[7 27] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:4 Threads:[9 29] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:8 Threads:[11 31] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:9 Threads:[13 33] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:10 Threads:[15 35] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:11 Threads:[17 37] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:12 Threads:[19 39] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None} I0219 09:20:38.026413 1 manager.go:231] Version: {KernelVersion:4.4.0-141-generic ContainerOsVersion:Alpine Linux v3.4 DockerVersion:18.03.1-ce DockerAPIVersion:1.37 CadvisorVersion:v0.28.3 CadvisorRevision:1e567c2} I0219 09:20:38.055068 1 factory.go:356] Registering Docker factory I0219 09:20:40.055562 1 factory.go:54] Registering systemd factory I0219 09:20:40.057960 1 factory.go:86] Registering Raw factory I0219 09:20:40.059964 1 manager.go:1178] Started watching for new ooms in manager I0219 09:20:40.067562 1 nvidia.go:110] NVML initialized. Number of nvidia devices: 8 I0219 09:20:40.127912 1 manager.go:329] Starting recovery of all containers I0219 09:20:40.290675 1 manager.go:334] Recovery completed I0219 09:20:40.483756 1 cadvisor.go:162] Starting cAdvisor version: v0.28.3-1e567c2 on port 8080 I0220 00:47:41.172113 1 manager.go:1168] Exiting thread watching subcontainers I0220 00:47:41.172156 1 manager.go:396] Exiting global housekeeping thread I0220 00:47:41.177712 1 cadvisor.go:196] Exiting given signal: terminated I0220 00:47:41.934458 1 storagedriver.go:50] Caching stats in memory for 2m0s I0220 00:47:41.934792 1 manager.go:151] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct" I0220 00:47:42.033740 1 fs.go:139] Filesystem UUIDs: map[f09b9eec-fc73-48a4-bd46-08c5f71543c2:/dev/sda2 79632962-af28-491d-ad41-8147a6c26188:/dev/sda1 a0de5d03-6714-41e9-abae-fc1219131e26:/dev/sdb1] I0220 00:47:42.033778 1 fs.go:140] Filesystem partitions: map[/dev/sdb1:{mountpoint:/rootfs/home major:8 minor:17 fsType:ext4 blockSize:0} shm:{mountpoint:/rootfs/var/lib/docker/containers/245c6964608dada9c65d6ba95d293419e43e13a4d99332de0e228dae7ae0916d/mounts/shm major:0 minor:56 fsType:tmpfs blockSize:0} tmpfs:{mountpoint:/dev major:0 minor:59 fsType:tmpfs blockSize:0} /dev/sda1:{mountpoint:/var/lib/docker major:8 minor:1 fsType:ext4 blockSize:0}] I0220 00:47:42.041158 1 manager.go:225] Machine: {NumCores:40 CpuFrequency:3400000 MemoryCapacity:135081676800 HugePages:[{PageSize:1048576 NumPages:0} {PageSize:2048 NumPages:0}] MachineID:398a0844cc22fd0844eb7106582008b0 SystemUUID:4C4C4544-005A-3910-8031-B4C04F4D4732 BootID:26cbef0a-5a1f-4b75-9bfe-9851ccd47692 Filesystems:[{Device:/dev/sda1 DeviceMajor:8 DeviceMinor:1 Capacity:779633565696 Type:vfs Inodes:48357376 HasInodes:true} {Device:/dev/sdb1 DeviceMajor:8 DeviceMinor:17 Capacity:787474333696 Type:vfs Inodes:48840704 HasInodes:true} {Device:shm DeviceMajor:0 DeviceMinor:56 Capacity:67108864 Type:vfs Inodes:16489462 HasInodes:true} {Device:none DeviceMajor:0 DeviceMinor:55 Capacity:779633565696 Type:vfs Inodes:48357376 HasInodes:true} {Device:tmpfs DeviceMajor:0 DeviceMinor:59 Capacity:67108864 Type:vfs Inodes:16489462 HasInodes:true}] DiskMap:map[8:0:{Name:sda Major:8 Minor:0 Size:800166076416 Scheduler:deadline} 8:16:{Name:sdb Major:8 Minor:16 Size:800166076416 Scheduler:deadline}] NetworkDevices:[{Name:eno1 MacAddress:f4:8e:38:ce:eb:64 Speed:1000 Mtu:1500} {Name:eno2 MacAddress:f4:8e:38:ce:eb:66 Speed:-1 Mtu:1500} {Name:enp20s0f0 MacAddress:a0:36:9f:c9:ad:ac Speed:10000 Mtu:1500} {Name:enp20s0f1 MacAddress:a0:36:9f:c9:ad:ae Speed:-1 Mtu:1500} {Name:mesos168088 MacAddress:f4:8e:38:ce:eb:64 Speed:10000 Mtu:1500} {Name:mesos61637 MacAddress:f4:8e:38:ce:eb:64 Speed:10000 Mtu:1500} {Name:mesos6676 MacAddress:f4:8e:38:ce:eb:64 Speed:10000 Mtu:1500}] Topology:[{Id:0 Memory:67450953728 Cores:[{Id:0 Threads:[0 20] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:1 Threads:[2 22] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:2 Threads:[4 24] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:3 Threads:[6 26] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:4 Threads:[8 28] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:8 Threads:[10 30] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:9 Threads:[12 32] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:10 Threads:[14 34] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:11 Threads:[16 36] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:12 Threads:[18 38] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]} {Id:1 Memory:67630723072 Cores:[{Id:0 Threads:[1 21] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:1 Threads:[3 23] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:2 Threads:[5 25] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:3 Threads:[7 27] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:4 Threads:[9 29] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:8 Threads:[11 31] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:9 Threads:[13 33] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:10 Threads:[15 35] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:11 Threads:[17 37] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]} {Id:12 Threads:[19 39] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None} I0220 00:47:42.042417 1 manager.go:231] Version: {KernelVersion:4.4.0-141-generic ContainerOsVersion:Alpine Linux v3.4 DockerVersion:18.03.1-ce DockerAPIVersion:1.37 CadvisorVersion:v0.28.3 CadvisorRevision:1e567c2} I0220 00:47:42.085252 1 factory.go:356] Registering Docker factory I0220 00:47:44.085677 1 factory.go:54] Registering systemd factory I0220 00:47:44.088278 1 factory.go:86] Registering Raw factory I0220 00:47:44.090789 1 manager.go:1178] Started watching for new ooms in manager I0220 00:47:44.098410 1 nvidia.go:110] NVML initialized. Number of nvidia devices: 8 I0220 00:47:44.159904 1 manager.go:329] Starting recovery of all containers I0220 00:47:44.317242 1 manager.go:334] Recovery completed I0220 00:47:44.500761 1 cadvisor.go:162] Starting cAdvisor version: v0.28.3-1e567c2 on port 8080

dashpole commented 5 years ago

Thanks for the log! So it looks like NVML initializes all 8 devices each time. This is probably a quirk with how mesos uses cgroups

tongbinxiang commented 5 years ago

Thanks for your answer. That is to say, is this the problem of mesos itself?

dashpole commented 5 years ago

No. Just that cAdvisor makes lots of assumptions about what a "container" is, and how it behaves. If the mesos containerizer doesn't follow those assumptions, then we can't monitor it.

Can you try running cAdvisor with --v 4 to see if we are getting this log: https://github.com/google/cadvisor/blob/master/manager/manager.go#L1024

If that doesn't show anything, you would probably have to add logging in https://github.com/google/cadvisor/blob/master/accelerators/nvidia.go#L163 to confirm that the nvidia device in question isn't in the devices cgroup when the container is started.

tongbinxiang commented 5 years ago

Sorry to disturb you, cadvisor runs with "--v 4" and doesn't show the log: https://github.com/google/cadvisor/blob/master/manager/manager.go#L1024

https://github.com/google/cadvisor/blob/master/accelerators/nvidia.go#L163 accesses devices.list then, I view mesos container cgroup file devices.list in /sys/fs/cgroup/devices/mesos/2cee36cb-6374-49b4-bcc7-666bbf5f0174 directory as follows: c : m b : m c 5:1 rwm c 4:0 rwm c 4:1 rwm c 136:* rwm c 5:2 rwm c 10:200 rwm c 1:3 rwm c 1:5 rwm c 1:7 rwm c 5:0 rwm c 1:9 rwm c 1:8 rwm c 243:0 rwm c 195:255 rwm c 195:5 rwm the last line c 195:5 rwm has nvidiaMinorNumber 5

mesos container[2cee36cb-6374-49b4-bcc7-666bbf5f0174] starts after cadvisor starts up. Now cadvisor can't get this container GPU uuid and relevant metrics.

dashpole commented 5 years ago

right. I'm sure the device is there eventually, but i'm wondering if there is a race condition happening. I.E. Mesos creates the cgroup, then cAdvisor reads from the devices cgroup, then the GPU is added. If you can build cAdvisor with an extra log line which prints out the contents of the cgroup, we could confirm that.

tongbinxiang commented 5 years ago

Thanks. You mean every time cAdvisor reads from the devices cgroup, then the GPU is added? If not, that the GPU is added happened for the first time, so cAdvisor should read GPU metrics next time. I will also add extra log lines to confirm that

sashankreddya commented 5 years ago

@tongbinxiang : is the mesos handler registered ? I don't see it in the logs for a message "Registering mesos factory" ?

tongbinxiang commented 5 years ago

@sashankreddya I don't see the message "Registering mesos factory" in the log. Is my cadvisor version too low?

tongbinxiang commented 5 years ago

Hello, @sashankreddya . I pull the latest image and don't also see the log "Registering mesos factory"

sashankreddya commented 5 years ago

@tongbinxiang : I am able to see the message when I start with the latest clone of cadvisor with "v=4".

Couple of things to make sure 1) Check if the mesosAgentAdress is correct as in here. The default is as present here 2) Check if you are not seeing any of these error conditions

Lapayo commented 3 years ago

I think I might have a similar issue with current mesos and cadvisor. It only shows gpus for containers that where started before cadvisor. In the log I can see that it is unable to find the devices.list So it seems like cadvisor is checking before that file is present.

Is there anyway to fix or work around this? Maybe by delaying cadvisor before picking up a new container for a few seconds?