Open yashhirulkar701 opened 4 months ago
Went through the same issue, it seems that the provided image ghcr.io/stackhpc/slurm-docker-cluster
does not contain the build of the cgroup v2 plugin:
sh-4.4# ls /usr/lib64/slurm/cgroup*
/usr/lib64/slurm/cgroup_v1.a /usr/lib64/slurm/cgroup_v1.la /usr/lib64/slurm/cgroup_v1.so
only cgroup_v1 is supported in this image. I will check how to build an image with cgroup_v2 or go down with cgroup_v1
After adding the cgroups_v2.so plugin I am getting:
slurmd: error: cgroup_dbus_attach_to_scope: cannot connect to dbus system daemon: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
slurmd: error: _init_new_scope_dbus: scope and/or cgroup directory for slurmstepd could not be set.
slurmd: error: cannot initialize cgroup directory for stepds: if the scope /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-poda8507382_5d59_4c0a_8119_c6651b541881.slice/system.slice/slurmstepd.scope already exists it means the associated cgroup directories disappeared and the scope entered in a failed state. You should investigate why the scope lost its cgroup directories and possibly use the 'systemd reset-failed' command to fix this inconsistent systemd state.
slurmd: error: Couldn't load specified plugin name for cgroup/v2: Plugin init() callback failed
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed
Probably better to disable cgroups by configuration.
Hi, I faced the same issue when setting up a k3s cluster on GCP. The problem stems from the cgroup version on the VM. My VM was running on a Debian environment. Here’s how I solved it:
Issue: cgroup/v1
First, check the cgroup version:
$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
The VM defaulted to cgroup v2, so I had to switch it back to cgroup v1. To do this, edit the GRUB configuration:
$ sudo nano /etc/default/grub
Add this line to set the system to use cgroup v1:
GRUB_CMDLINE_LINUX="... systemd.unified_cgroup_hierarchy=0"
Update the GRUB configuration:
$ sudo update-grub
$ sudo reboot
After reboot, verify the cgroup version again:
$ mount | grep cgroup
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Install NFS on Slurm client nodes:
sudo apt-get update
sudo apt-get install -y nfs-common
Once these steps were complete, I was able to redeploy the repository, and the Slurm daemon pods started running successfully. This should help you resolve the issue!