treydock / cgroup_exporter

Apache License 2.0
18 stars 7 forks source link

Slurm cgroup_v2 "/sys/fs/cgroup/system.slice/nodename_slurmstepd.scope" #30

Closed pddro-Vestas closed 6 months ago

pddro-Vestas commented 6 months ago

@treydock I was testing version 1.0.0-rc.0 in Ubuntu 22.04, Slurm and cgroup_v2, and I am getting the below error:

caller=cgroupv2.go:221 level=error msg="Error loading cgroup processes" path=/slurm group=/system.slice/slurmstepd.scope err="lstat /sys/fs/cgroup/system.slice/slurmstepd.scope: no such file or directory"

I have done some investigation and it looks like Slurm uses two formats of cgroup directories (See https://slurm.schedmd.com/cgroup_v2.html#slurmd_startup for more details):

"/sys/fs/cgroup/system.slice/slurmstepd.scope"
"/sys/fs/cgroup/system.slice/nodename_slurmstepd.scope"

In my situation, Slurm is creating using the format: "/sys/fs/cgroup/system.slice/nodename_slurmstepd.scope", which is not currently supported.

Could you support this format? Thanks in advance for your support.

treydock commented 6 months ago

Does the nodename path have the literal string "nodename" or is nodename replaced with the slurmd node name? Since this appears to happen when compiled with allowing multiple slurmd, it likely is replaced with that slurmd's nodename but would be good to confirm.

Could you share this outout?

ls -la /sys/fs/cgroup/system.slice/ | grep slurm

Also do you compile Slurm with --enable-multiple-slurmd and do you need that in what I presume is a production system? My understanding is --enable-multiple-slurmd is a feature of Slurm intended for development.

pddro-Vestas commented 6 months ago

The nodename is replaced with the actual node name. See below output from ls (I have redacted the 1st part of the nodename).

ls -la /sys/fs/cgroup/system.slice/ | grep slurm
drwxr-xr-x   2 root root 0 May 16 06:28 slurmd.service
drwxr-xr-x 123 root root 0 May 16 06:28 [redacted]-hpc-1_slurmstepd.scope

We are using precompiled binaries from the project https://github.com/Azure/cyclecloud-slurm.

I have done a very naïve implementation of this cgroup format here: https://github.com/pddro-Vestas/cgroup_exporter/blob/a9ffb3a5f97e04bd3a2260efa4701f139346980c/collector/cgroupv2.go#L209C5-L209C11

I have been testing it for the last few days and it is working correctly.

Thanks for your help.

treydock commented 6 months ago

I don't think hostname will work cause NodeName doesn't have to equal hostname in Slurm. I have several systems where their hostname maybe quite different than the NodeName.

I think what's needed here is to remove this assumption:

https://github.com/treydock/cgroup_exporter/blob/8a7e6603471178661c3b78f9533cc59cf2b88238/collector/cgroupv2.go#L203-L207

I kept it so that --config.paths=/slurm would work for both Cgroupv1 and Cgroupv2 even though the path for Cgroupv2 is different. I think the simplest approach would be to allow you to do this:

--config.paths=/[nodename]_slurmstepd.scope

Then the exporter could discover the Slurm cgroupv2 path. I'll try to come up with a patch that works for your case but it will likely require you to pass the --config.paths flag to exporter with something specific to the cgroups you have. I'll mention this in the README so hopefully it's clear that simply doing --config.paths=/slurm won't work in all cases.

pddro-Vestas commented 6 months ago

Implementing it has a parameter is a great idea. Your suggestion --config.paths=/[nodename]_slurmstepd.scope will work well in our system, because I do the configuration on node boot up, and I can get the hostname easily.

Just let me know when done, I will be happy to test the new version.

Again, thanks a lot for your help.

treydock commented 6 months ago

A fix is released with 1.0.0-rc.1. Let me know if using --config.paths=/system.slice/<nodename>_slurmstepd.scope works for you and I'll do a 1.0.0 release.

pddro-Vestas commented 5 months ago

@treydock thanks for the new release. I have deployed in one of our test clusters. Will let you know if I see any issues.

pddro-Vestas commented 5 months ago

@treydock 1.0.0-rc.1 is working perfectly in our test env. Thanks.