treydock / cgroup_exporter

Apache License 2.0
17 stars 5 forks source link

Crash attempting to get slurm cgroups #15

Closed groucho64738 closed 3 years ago

groucho64738 commented 3 years ago

Running on RH7: ./cgroup_exporter --config.paths=/slurm. Once I attempt to view the metrics, the application crashes:

level=info ts=2021-06-08T11:12:06.263Z caller=cgroup_exporter.go:493 msg="Starting cgroup_exporter" version="(version=0.7.0, branch=master, revision=4678727df26fee7418ea11a8ad8228c688c7f87d)" level=info ts=2021-06-08T11:12:06.263Z caller=cgroup_exporter.go:494 msg="Build context" build_context="(go=go1.16.5, user=root@aos-ood-graph, date=20210607-15:01:58)" level=info ts=2021-06-08T11:12:06.263Z caller=cgroup_exporter.go:495 msg="Starting Server" address=:9306 panic: interface conversion: interface is nil, not cgroups.pather

goroutine 10 [running]: github.com/containerd/cgroups.(cgroup).processes(0xc000296040, 0xa8ca1b, 0x7, 0xc000181701, 0xc000294090, 0xc0002a2000, 0xc000298040, 0x1, 0xc000290030) /root/go/pkg/mod/github.com/containerd/cgroups@v1.0.1/cgroup.go:338 +0x118 github.com/containerd/cgroups.(cgroup).Processes(0xc000296040, 0xa8ca1b, 0x7, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0) /root/go/pkg/mod/github.com/containerd/cgroups@v1.0.1/cgroup.go:329 +0x105 main.(Exporter).collect(0xc000195600, 0x0, 0x0, 0x0, 0x0, 0x0) /root/cgroup_exporter/cgroup_exporter.go:370 +0x6b1 main.(Exporter).Collect(0xc000195600, 0xc0000d8e40) /root/cgroup_exporter/cgroup_exporter.go:435 +0x45 github.com/prometheus/client_golang/prometheus.(Registry).Gather.func1() /root/go/pkg/mod/github.com/prometheus/client_golang@v1.10.0/prometheus/registry.go:446 +0x12b created by github.com/prometheus/client_golang/prometheus.(Registry).Gather /root/go/pkg/mod/github.com/prometheus/client_golang@v1.10.0/prometheus/registry.go:457 +0x5ce

We have slurm directories under: /sys/fs/cgroup/cpuset/slurm /sys/fs/cgroup/memory/slurm /sys/fs/cgroup/devices/slurm /sys/fs/cgroup/freezer/slurm

treydock commented 3 years ago

That appears to be null pointer inside the containerd library this exporter uses. The place in this exporter last referenced is where the cpuacct cgroup is queried. Do you have the cpuacct cgroup for SLURM?

This is what I have on one of my compute nodes:

[root@p0001 ~]# ls -la /sys/fs/cgroup/
total 0
drwxr-xr-x 13 root root 340 Mar 31 19:58 .
drwxr-xr-x  9 root root   0 Mar 31 19:58 ..
drwxr-xr-x  4 root root   0 Mar 31 19:58 blkio
lrwxrwxrwx  1 root root  11 Mar 31 19:58 cpu -> cpu,cpuacct
lrwxrwxrwx  1 root root  11 Mar 31 19:58 cpuacct -> cpu,cpuacct
drwxr-xr-x  5 root root   0 Mar 31 19:58 cpu,cpuacct
drwxr-xr-x  3 root root   0 Mar 31 19:58 cpuset
drwxr-xr-x  5 root root   0 Mar 31 19:58 devices
drwxr-xr-x  3 root root   0 Mar 31 19:58 freezer
drwxr-xr-x  2 root root   0 Mar 31 19:58 hugetlb
drwxr-xr-x  5 root root   0 Mar 31 19:58 memory
lrwxrwxrwx  1 root root  16 Mar 31 19:58 net_cls -> net_cls,net_prio
drwxr-xr-x  2 root root   0 Mar 31 19:58 net_cls,net_prio
lrwxrwxrwx  1 root root  16 Mar 31 19:58 net_prio -> net_cls,net_prio
drwxr-xr-x  2 root root   0 Mar 31 19:58 perf_event
drwxr-xr-x  4 root root   0 Mar 31 19:58 pids
drwxr-xr-x  4 root root   0 Mar 31 19:58 systemd

It does look like the exporter is using some code paths that are only needed if you pass --collect.proc so might be able to avoid the issue if avoid those code paths if that flag is not passed.

treydock commented 3 years ago

More useful example of the cgroups I have tested against:

[root@p0001 ~]# find /sys/fs/cgroup -type d -name slurm
/sys/fs/cgroup/freezer/slurm
/sys/fs/cgroup/cpu,cpuacct/slurm
/sys/fs/cgroup/cpuset/slurm
/sys/fs/cgroup/memory/slurm
/sys/fs/cgroup/devices/slurm

If you don't have cpu,cpuacct that will cause errors I think though I know one issue I've run into is when a compute node first boots the SLURM cgroups don't exist till the first job runs, and that has never caused issues like this at my site. It produces errors but not crashes.

treydock commented 3 years ago

I was able to reproduce the issue using the unit tests and removing fixtures/cpuacct so it looks like the absence of the cpuacct cgroup is causing this problem. I will look into it.

treydock commented 3 years ago

Doing some testing and without the cpuacct cgroup, this exporter won't work. I'm not 100% certain which settings in SLURM or the OS would have caused you to miss the cpuacct cgroup, I don't do anything to modify cgroup behavior at the OS level but for SLURM these are my relevant cgroup settings:

# grep cgroup /etc/slurm/slurm.conf
JobAcctGatherType=jobacct_gather/cgroup
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
# cat /etc/slurm/cgroup.conf
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
AllowedRAMSpace=100
AllowedSwapSpace=0
ConstrainCores=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MaxRAMPercent=100
MaxSwapPercent=100
MaxKmemPercent=100
MemorySwappiness=0
MinKmemSpace=30
MinRAMSpace=30
TaskAffinity=no
groucho64738 commented 3 years ago

You're correct, I don't have a cpuacct cgroup entry on my systems. I do have slightly different settings in my config. In the slurm.conf I have JobAcctGatherType set, but is set to jobacct_gather/linux and not cgroup which sounds like it might be the thing. I don't see anything in the cgroup.conf that seems like it would create that particular item. The man page for slurm.conf doesn't fill me with confidence that I should change that setting while I have jobs running, so it might be a bit before I turn that on to see if it makes a difference.

groucho64738 commented 3 years ago

FYI, changing from jobacct_gather/linux to jobacct_gather/cgroup did create the cpuacct directory. Let me set that up across the board and see how it goes. The slurm documentation indicates that it's a much slower mechanism for accounting than the 'linux' one, but I don't know if that will really affect our cluster or not. We'll have to see if jobs get stuck in job cleanup for too long.

groucho64738 commented 3 years ago

It works. I launched cgroup_exporter --config-paths=/slurm, then launched a job and saw the metrics being collected.

treydock commented 3 years ago

Just FYI that if you use cgroups at all you almost always want jobacct_gather/cgroup, I believe it's more accurate when doing accounting for cgroup jobs. We run that setting on a ~650 node cluster and on a ~825 node cluster and have not noticed any performance issues even with supporting ~20-30K jobs a day per cluster. The only time our jobs get stuck during cleanup is if the job is stuck in IO wait doing something with our GPFS filesystem and the job can't be cleaned up due to processes in D state. We've worked around this by setting UnkillableStepTimeout=180 because unkillable jobs will end up marking a node as drained, we still face some unkillable jobs but usually related to GPFS issues.