stackhpc / slurm-k8s-cluster

A Slurm cluster for Kubernetes
MIT License
46 stars 20 forks source link

Error:- cannot find cgroup plugin for cgroup/v2, slurmd initialization failed #39

Open yashhirulkar701 opened 4 months ago

yashhirulkar701 commented 4 months ago
i am trying to create a sclurm cluster on kubernetes (azure kubernetes service). but the slurmd pod keeps crashing giving error "Couldn't find the specified plugin name for cgroup/v2 looking at all files". Have mentioned the errors below for pods  slurmctld and  slurmd.

I have tried to debug it a lot but no luck. Any idea on how to fix this on kubernetes cluster.

i could see that slurmdb is unable to connect with slurmctld as shown below. 

> k logs -f slurmdbd-6f59cc7887-4mwwq 

slurmdbd: debug2: _slurm_connect: failed to connect to 10.244.3.117:6817: Connection refused
slurmdbd: debug2: Error connecting slurm stream socket at 10.244.3.117:6817: Connection refused
slurmdbd: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:10.244.3.117:6817: Connection refused
slurmdbd: error: slurmdb_send_accounting_update_persist: Unable to open connection to registered cluster linux.
slurmdbd: error: slurm_receive_msg: No response to persist_init
slurmdbd: error: update cluster: Connection refused to linux at 10.244.3.117(6817)

> k logs -f slurmctld-0

slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state`, No such file or directory
slurmctld: error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state.old`, No such file or directory
slurmctld: No job state file (/var/spool/slurmctld/job_state.old) found
slurmctld: debug2: accounting_storage/slurmdbd: _send_cluster_tres: Sending tres '1=40,2=10,3=0,4=10,5=40,6=0,7=0,8=0' for cluster
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.5:7132]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-1 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-1/slurmd-1/slurmd-1 assigned to node slurmd-1
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-1 
slurmctld: debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.6:38712]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-0 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-0/slurmd-0/slurmd-0 assigned to node slurmd-0
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-0 
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints

> k logs -f pod/slurmd-0           
---> Set shell resource limits ...
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3547560
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
---> Copying MUNGE key ...
---> Starting the MUNGE Authentication service (munged) ...
---> Waiting for slurmctld to become active before starting slurmd...
-- slurmctld is now active ...
---> Starting the Slurm Node Daemon (slurmd) ...
slurmd: CPUs=96 Boards=1 Sockets=2 Cores=48 Threads=1 Memory=886898 TmpDisk=0 Uptime=37960 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:96 Boards:1 Sockets:2 CoresPerSocket:48 ThreadsPerCore:1
slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
slurmd: error: cannot find cgroup plugin for cgroup/v2
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed
akram commented 2 months ago

Went through the same issue, it seems that the provided image ghcr.io/stackhpc/slurm-docker-cluster does not contain the build of the cgroup v2 plugin:

sh-4.4# ls /usr/lib64/slurm/cgroup*
/usr/lib64/slurm/cgroup_v1.a  /usr/lib64/slurm/cgroup_v1.la  /usr/lib64/slurm/cgroup_v1.so

only cgroup_v1 is supported in this image. I will check how to build an image with cgroup_v2 or go down with cgroup_v1

akram commented 2 months ago

After adding the cgroups_v2.so plugin I am getting:

slurmd: error: cgroup_dbus_attach_to_scope: cannot connect to dbus system daemon: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
slurmd: error: _init_new_scope_dbus: scope and/or cgroup directory for slurmstepd could not be set.
slurmd: error: cannot initialize cgroup directory for stepds: if the scope /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-poda8507382_5d59_4c0a_8119_c6651b541881.slice/system.slice/slurmstepd.scope already exists it means the associated cgroup directories disappeared and the scope entered in a failed state. You should investigate why the scope lost its cgroup directories and possibly use the 'systemd reset-failed' command to fix this inconsistent systemd state.
slurmd: error: Couldn't load specified plugin name for cgroup/v2: Plugin init() callback failed
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed

Probably better to disable cgroups by configuration.

pong1013 commented 1 month ago

Hi, I faced the same issue when setting up a k3s cluster on GCP. The problem stems from the cgroup version on the VM. My VM was running on a Debian environment. Here’s how I solved it:

  1. Issue: cgroup/v1
    First, check the cgroup version:

    $ mount | grep cgroup
    cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
  2. The VM defaulted to cgroup v2, so I had to switch it back to cgroup v1. To do this, edit the GRUB configuration:

    $ sudo nano /etc/default/grub

    Add this line to set the system to use cgroup v1:

    GRUB_CMDLINE_LINUX="... systemd.unified_cgroup_hierarchy=0"

    Update the GRUB configuration:

    $ sudo update-grub
    $ sudo reboot
  3. After reboot, verify the cgroup version again:

    $ mount | grep cgroup
    cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
    cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
  4. Install NFS on Slurm client nodes:

    sudo apt-get update
    sudo apt-get install -y nfs-common

Once these steps were complete, I was able to redeploy the repository, and the Slurm daemon pods started running successfully. This should help you resolve the issue!