Feature request - SLURM support

yuvalkirstain commented 1 year ago

Hi, it will be really great if you can add SLURM support, or at least add a doc that shows how to run accelerate with multiple nodes on SLURM. I saw that there are several issues that involve people that want to use accelerate with SLURM. Thank you!

yuvalkirstain commented 1 year ago

Here is my solution - hope this can help accelerate to support SLURM :)

It requires submitit (I also use hydra, but you can switch to argparse or fire), but makes everything super easy to use. You simply need to add the cmd for your script and you are pretty much done :)

import os
import os
import random
import sys
import hydra
import submitit
from omegaconf import DictConfig
from trainer.accelerators.utils import nvidia_smi_gpu_memory_stats

def print_env():
    for key in sorted(os.environ.keys()):
        if not (
                key.startswith(("SLURM_", "SUBMITIT_"))
                or key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE", "LOCAL_RANK", "LOCAL_WORLD_SIZE")
        ):
            continue
        value = os.environ[key]
        print(f"{key}={value}")

class Task:

    def __init__(self, cfg: DictConfig):
        self.cfg = cfg

    def __call__(self):
        print("Running task on slurm")
        print("exporting PyTorch distributed environment variables")
        dist_env = submitit.helpers.TorchDistributedEnvironment()
        rng = random.Random(dist_env._job_env.job_id)
        dist_env.master_port = rng.randint(10000, 20000)
        dist_env = dist_env.export()
        os.environ.update(**{
            "CUDA_LAUNCH_BLOCKING": "1",
            "NCCL_DEBUG": "info",
            "CUDA_VISIBLE_DEVICES": os.environ["SLURM_JOB_GPUS"],
        })
        print(nvidia_smi_gpu_memory_stats())
        print(f"master: {dist_env.master_addr}:{dist_env.master_port}")
        print(f"rank: {dist_env.rank}")
        print(f"world size: {dist_env.world_size}")
        print(f"local rank: {dist_env.local_rank}")
        print(f"local world size: {dist_env.local_world_size}")
        print("Running training script")
        print(f"Local rank {dist_env.local_rank}: {os.environ['CUDA_VISIBLE_DEVICES']=}")
        num_processes = self.cfg.slurm.n_processes * self.cfg.slurm.n_nodes
        machine_rank = dist_env.rank // self.cfg.slurm.n_processes
        cmd = f"accelerate launch --dynamo_backend no --num_processes {num_processes} --num_machines {self.cfg.slurm.n_nodes} --use_deepspeed --machine_rank {machine_rank} --main_process_ip {dist_env.master_addr} --main_process_port {dist_env.master_port} trainer/scripts/train.py {self.cfg.slurm.cmd}"
        print(f"Running command: {cmd}")
        print_env()
        if dist_env.local_rank == 0:
            os.system(cmd)
        else:
            print("Waiting for master to finish")

    def checkpoint(self):
        print("checkpointing")
        return submitit.helpers.DelayedSubmission(self)

@hydra.main(version_base=None, config_path="../conf", config_name="slurm_config")
def main(cfg: DictConfig) -> None:
    # import pydevd_pycharm
    # pydevd_pycharm.settrace('localhost', port=5900, stdoutToServer=True, stderrToServer=True)
    executor = submitit.AutoExecutor(folder="logs")
    print(cfg)
    slurm_kwargs = {
        "slurm_job_name": cfg.slurm.job_name,
        "slurm_partition": cfg.slurm.partition,
        "slurm_nodes": cfg.slurm.n_nodes,
        "slurm_additional_parameters": {
            "gpus": cfg.slurm.n_processes,
            "ntasks_per_node": cfg.slurm.n_processes,
        },
        "slurm_cpus_per_task": 12,
        "slurm_time": cfg.slurm.time_limit,
        "slurm_exclude": cfg.slurm.exclude if cfg.slurm.exclude else "",
        "stderr_to_stdout": True,
        "slurm_mem": "10GB",
    }
    executor.update_parameters(**slurm_kwargs)

    task = Task(cfg)
    job = executor.submit(task)
    submitit.helpers.monitor_jobs([job])

if __name__ == "__main__":
    sys.exit(main())

muellerzr commented 1 year ago

Great work @yuvalkirstain! We likely wouldn't want to use submitit, considering their last commit was 6 months ago and doesn't inspire confidence. Do you know of any other SLURM management packages we should consider? Otherwise I'll look into some alternatives here once some time is available.

However for the time being this is definitely a way for users to use SLURM we'll point them to :)

yuvalkirstain commented 1 year ago

@muellerzr submitit package is maintained and many in FAIR use it. SLURM is not changing very frequently, so I would not worry about it :)

muellerzr commented 1 year ago

CC @sgugger

sgugger commented 1 year ago

Yes I'm sur many at fair use it since it's a facebookincubator project. It remains that the last commit is 6 months old. I see an issue opened 6 months ago by some folks at PyTorch Lighning using this, where there has not been a response since that time. All of this are big red flags for using this project as any kind of dependency.

muellerzr commented 1 year ago

Thanks to @lvwerra, here's a template script that can be used for doing SLURM:

#!/bin/bash
#SBATCH --job-name=XYZ
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=96
#SBATCH --gres=gpu:8
#SBATCH --exclusive
#SBATCH --partition=production-cluster
#SBATCH --output=~/logs/%x-%j.out

set -x -e

source ~/leandro/.bashrc

conda activate trl

echo "START TIME: $(date)"

# Training setup
GPUS_PER_NODE=8
# so processes know who to talk to
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
NNODES=$SLURM_NNODES
NODE_RANK=$SLURM_PROCID 
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

cd ~/git/my_project/

CMD=" \
    train.py \
    --model_name ... \
    --whatever_args_for_your_script ... \
    "

LAUNCHER="accelerate launch \
    --multi_gpu \
    --num_machines $NNODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip "$MASTER_ADDR" \
    --main_process_port $MASTER_PORT \
    --num_processes $WORLD_SIZE \
    --machine_rank \$SLURM_PROCID \
    --role $SLURMD_NODENAME: \
    --rdzv_conf rdzv_backend=c10d \
    --max_restarts 0 \
    --tee 3 \
"

# NOT SURE THE FOLLOWING ENV VARS IS STRICTLY NEEDED (PROBABLY NOT)
export CUDA_HOME=/usr/local/cuda-11.6
export LD_PRELOAD=$CUDA_HOME/lib/libnccl.so
export LD_LIBRARY_PATH=$CUDA_HOME/efa/lib:$CUDA_HOME/lib:$CUDA_HOME/lib64:$LD_LIBRARY_PATH

SRUN_ARGS=" \
    --wait=60 \
    --kill-on-bad-exit=1 \
    "

clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER $CMD" 2>&1 | tee ~/logs/%x-%j.txt

echo "END TIME: $(date)"

WeitaoVan commented 1 year ago

@muellerzr thanks for sharing your script. May I ask why I got error "IndexError: list index out of range" when excuting the command? As shown in the screenshot. I ran this code to use 2 nodes and 8 gpus per node.