lopentusska / slurm_ubuntu_gpu_cluster

Instructions for setting up a Slurm gpu cluster on Ubuntu 22.04.
MIT License
4 stars 0 forks source link

slurm_ubuntu_gpu_cluster

Guide on how to set up gpu cluster on Ubuntu 22.04 using slurm (with cgroups).

Acknowledgements

Thanks to nateGeorge for the guide he wrote. I would highly recommend checking it out first as it is very descriptive.

Assumptions:

Additionally, after step one (Set hostname on the server) of the guide in /etc/hosts after <IP> <FQDN> add <name> of the node so it would look like 111.xx.111.xx masternode.master.local masternode.

Also, on worker_node do the following:

Master node:

sudo apt-get update
sudo apt-get install nfs-kernel-server
sudo mkdir /storage -p
sudo chown master_node:master_node /storage/
sudo vim /etc/exports
/storage 222.xx.222.xx(rw,sync,no_root_squash,no_subtree_check)
sudo systemctl restart nfs-kernel-server
sudo ufw allow from 222.xx.222.xx to any port nfs

Worker node:

sudo apt-get update
sudo apt-get install nfs-common
sudo mkdir -p /storage
sudo mount 111.xx.111.xx:/storage /storage
echo 111.xx.111.xx:/storage /storage nfs auto,timeo=14,intr 0 0 | sudo tee -a /etc/fstab
sudo chown worker_node:worker_node /storage/

Set up MUNGE

Master node:

sudo apt-get install libmunge-dev libmunge2 munge -y
sudo systemctl enable munge
sudo systemctl start munge
munge -n | unmunge | grep STATUS
sudo cp /etc/munge/munge.key /storage/
sudo chown munge /storage/munge.key
sudo chmod 400 /storage/munge.key

Worker node:

sudo apt-get install libmunge-dev libmunge2 munge
sudo cp /storage/munge.key /etc/munge/munge.key
sudo systemctl enable munge
sudo systemctl start munge
munge -n | unmunge | grep STATUS

Set up DB for Slurm

Clone this repo with config and service files:

cd /storage
git clone https://github.com/lopentusska/slurm_ubuntu_gpu_cluster

Install prerequisites for DB:

sudo apt-get install python3 gcc make openssl ruby ruby-dev libpam0g-dev libmariadb-dev mariadb-server build-essential libssl-dev numactl hwloc libmunge-dev man2html lua5.3 -y
sudo gem install fpm
sudo systemctl enable mysql
sudo systemctl start mysql
sudo mysql -u root
create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
exit

Copy db config: cp /storage/slurm_ubuntu_gpu_cluster/configs_services/slurmdbd.conf /storage

Set up Slurm:

Download and install Slurm on master node

Build installation file

You should check slurm download page and install the latest version.

cd /storage
wget https://download.schedmd.com/slurm/slurm-23.11.4.tar.bz2
tar xvjf slurm-23.11.4.tar.bz2
cd slurm-23.11.4/
./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm
make
make contrib
make install
cd ..

Install Slurm

sudo fpm -s dir -t deb -v 1.0 -n slurm-23.11.4 --prefix=/usr -C /tmp/slurm-build/ .
sudo dpkg -i slurm-23.11.4_1.0_amd64.deb

Make directories:

sudo mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
sudo chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm

Copy slurm services:

sudo cp /storage/slurm_ubuntu_gpu_cluster/configs_services/slurmdbd.service /etc/systemd/system/
sudo cp /storage/slurm_ubuntu_gpu_cluster/configs_services/slurmctld.service /etc/systemd/system/

Copy slurm DB config:

sudo cp /storage/slurmdbd.conf /etc/slurm/
sudo chmod 600 /etc/slurm/slurmdbd.conf
sudo chown slurm /etc/slurm/slurmdbd.conf

Open ports for slurm communcation:

sudo ufw allow from any to any port 6817
sudo ufw allow from any to any port 6818

Start slurm services:

sudo systemctl daemon-reload
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
sudo systemctl enable slurmctld
sudo systemctl start slurmctld

If master is a compute (worker) node:

sudo cp /storage/slurm_ubuntu_gpu_cluster/configs_services/slurmd.service /etc/systemd/system/
sudo systemctl enable slurmd
sudo systemctl start slurmd

Install Slurm on worker node:

cd /storage
sudo dpkg -i slurm-23.11.4_1.0_amd64.deb
sudo cp /storage/slurm_ubuntu_gpu_cluster/configs_services/slurmd.service /etc/systemd/system

Open ports for slurm communcation:

sudo ufw allow from any to any port 6817
sudo ufw allow from any to any port 6818
sudo systemctl enable slurmd
sudo systemctl start slurmd

Configure Slurm

In /storage/slurm_ubuntu_gpu_cluster/configs_services/slurm.conf change:

ControlMachine=masternode.master.local - use your FQDN

ControlAddr=111.xx.111.xx - use IP of your masternode

Use sudo slurmd -C to print out machine specs. You should copy specs of all machines in slurm.conf file and modify it.
example of how it should look in your config file:

NodeName=masternode NodeAddr=111.xx.111.xx Gres=gpu:1 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=63502

After you are done with slurm.conf editing:

sudo cp /storage/slurm_ubuntu_gpu_cluster/configs_services/slurm.conf /storage/

Edit /storage/slurm_ubuntu_gpu_cluster/configs_services/gres.conf file.

NodeName=masternode Name=gpu File=/dev/nvidia0
NodeName=workernode Name=gpu File=/dev/nvidia0

You can use nvidia-smi to find out the number you should use instead of 0 in nvidia0. You will find it to the left of the GPU name.

Copy .conf files (except slurmdbd.conf) on all machines:
on worker_node create slurm directory: sudo mkdir /etc/slurm/

sudo cp /storage/slurm_ubuntu_gpu_cluster/configs_services/cgroup* /etc/slurm/
sudo cp /storage/slurm_ubuntu_gpu_cluster/configs_services/slurm.conf /etc/slurm/
sudo cp /storage/slurm_ubuntu_gpu_cluster/configs_services/gres.conf /etc/slurm/
sudo mkdir -p /var/spool/slurm/d
sudo chown slurm /var/spool/slurm/d

Configure cgroups

sudo vim /etc/default/grub

add:

GRUB_CMDLINE_LINUX_DEFAULT="cgroup_enable=memory systemd.unified_cgroup_hierarchy=0"

then:

sudo update-grub

Start Slurm

Reboot machines and:
on master_node:

sudo systemctl restart slurmctld
sudo systemctl restart slurmdbd
sudo systemctl restart slurmd

on worker_node:

sudo systemctl restart slurmd

Create cluster: sudo sacctmgr add cluster compute-cluster

Finally:

sudo apt update
sudo apt upgrade
sudo apt autoremove

Logs

If something doesn't work, you can find logs for slurmctld, slurmdbd and slurmd in /var/log/slurm/.

Script

I've also added a simple script to check if slurm works that would run srun hostname, which basically will print out the node on which the job was started.

You will need to move the file in the /storage.

Inside the script change:

partition,

nodelist (choose on which node to run),

Then you can run script with:

sbatch script_slurm_hostname.sh