Closed pddro-Vestas closed 6 months ago
Does the nodename
path have the literal string "nodename" or is nodename
replaced with the slurmd node name? Since this appears to happen when compiled with allowing multiple slurmd, it likely is replaced with that slurmd's nodename but would be good to confirm.
Could you share this outout?
ls -la /sys/fs/cgroup/system.slice/ | grep slurm
Also do you compile Slurm with --enable-multiple-slurmd
and do you need that in what I presume is a production system? My understanding is --enable-multiple-slurmd
is a feature of Slurm intended for development.
The nodename is replaced with the actual node name. See below output from ls (I have redacted the 1st part of the nodename).
ls -la /sys/fs/cgroup/system.slice/ | grep slurm
drwxr-xr-x 2 root root 0 May 16 06:28 slurmd.service
drwxr-xr-x 123 root root 0 May 16 06:28 [redacted]-hpc-1_slurmstepd.scope
We are using precompiled binaries from the project https://github.com/Azure/cyclecloud-slurm.
I have done a very naïve implementation of this cgroup format here: https://github.com/pddro-Vestas/cgroup_exporter/blob/a9ffb3a5f97e04bd3a2260efa4701f139346980c/collector/cgroupv2.go#L209C5-L209C11
I have been testing it for the last few days and it is working correctly.
Thanks for your help.
I don't think hostname will work cause NodeName doesn't have to equal hostname in Slurm. I have several systems where their hostname maybe quite different than the NodeName.
I think what's needed here is to remove this assumption:
I kept it so that --config.paths=/slurm
would work for both Cgroupv1 and Cgroupv2 even though the path for Cgroupv2 is different. I think the simplest approach would be to allow you to do this:
--config.paths=/[nodename]_slurmstepd.scope
Then the exporter could discover the Slurm cgroupv2 path. I'll try to come up with a patch that works for your case but it will likely require you to pass the --config.paths
flag to exporter with something specific to the cgroups you have. I'll mention this in the README so hopefully it's clear that simply doing --config.paths=/slurm
won't work in all cases.
Implementing it has a parameter is a great idea. Your suggestion --config.paths=/[nodename]_slurmstepd.scope
will work well in our system, because I do the configuration on node boot up, and I can get the hostname easily.
Just let me know when done, I will be happy to test the new version.
Again, thanks a lot for your help.
A fix is released with 1.0.0-rc.1. Let me know if using --config.paths=/system.slice/<nodename>_slurmstepd.scope
works for you and I'll do a 1.0.0 release.
@treydock thanks for the new release. I have deployed in one of our test clusters. Will let you know if I see any issues.
@treydock 1.0.0-rc.1 is working perfectly in our test env. Thanks.
@treydock I was testing version 1.0.0-rc.0 in Ubuntu 22.04, Slurm and cgroup_v2, and I am getting the below error:
caller=cgroupv2.go:221 level=error msg="Error loading cgroup processes" path=/slurm group=/system.slice/slurmstepd.scope err="lstat /sys/fs/cgroup/system.slice/slurmstepd.scope: no such file or directory"
I have done some investigation and it looks like Slurm uses two formats of cgroup directories (See https://slurm.schedmd.com/cgroup_v2.html#slurmd_startup for more details):
In my situation, Slurm is creating using the format: "/sys/fs/cgroup/system.slice/nodename_slurmstepd.scope", which is not currently supported.
Could you support this format? Thanks in advance for your support.