huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.34k stars 875 forks source link

Feature request - SLURM support #1239

Open yuvalkirstain opened 1 year ago

yuvalkirstain commented 1 year ago

Hi, it will be really great if you can add SLURM support, or at least add a doc that shows how to run accelerate with multiple nodes on SLURM. I saw that there are several issues that involve people that want to use accelerate with SLURM. Thank you!

yuvalkirstain commented 1 year ago

Here is my solution - hope this can help accelerate to support SLURM :)

It requires submitit (I also use hydra, but you can switch to argparse or fire), but makes everything super easy to use. You simply need to add the cmd for your script and you are pretty much done :)

import os
import os
import random
import sys
import hydra
import submitit
from omegaconf import DictConfig
from trainer.accelerators.utils import nvidia_smi_gpu_memory_stats

def print_env():
    for key in sorted(os.environ.keys()):
        if not (
                key.startswith(("SLURM_", "SUBMITIT_"))
                or key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE", "LOCAL_RANK", "LOCAL_WORLD_SIZE")
        ):
            continue
        value = os.environ[key]
        print(f"{key}={value}")

class Task:

    def __init__(self, cfg: DictConfig):
        self.cfg = cfg

    def __call__(self):
        print("Running task on slurm")
        print("exporting PyTorch distributed environment variables")
        dist_env = submitit.helpers.TorchDistributedEnvironment()
        rng = random.Random(dist_env._job_env.job_id)
        dist_env.master_port = rng.randint(10000, 20000)
        dist_env = dist_env.export()
        os.environ.update(**{
            "CUDA_LAUNCH_BLOCKING": "1",
            "NCCL_DEBUG": "info",
            "CUDA_VISIBLE_DEVICES": os.environ["SLURM_JOB_GPUS"],
        })
        print(nvidia_smi_gpu_memory_stats())
        print(f"master: {dist_env.master_addr}:{dist_env.master_port}")
        print(f"rank: {dist_env.rank}")
        print(f"world size: {dist_env.world_size}")
        print(f"local rank: {dist_env.local_rank}")
        print(f"local world size: {dist_env.local_world_size}")
        print("Running training script")
        print(f"Local rank {dist_env.local_rank}: {os.environ['CUDA_VISIBLE_DEVICES']=}")
        num_processes = self.cfg.slurm.n_processes * self.cfg.slurm.n_nodes
        machine_rank = dist_env.rank // self.cfg.slurm.n_processes
        cmd = f"accelerate launch --dynamo_backend no --num_processes {num_processes} --num_machines {self.cfg.slurm.n_nodes} --use_deepspeed --machine_rank {machine_rank} --main_process_ip {dist_env.master_addr} --main_process_port {dist_env.master_port} trainer/scripts/train.py {self.cfg.slurm.cmd}"
        print(f"Running command: {cmd}")
        print_env()
        if dist_env.local_rank == 0:
            os.system(cmd)
        else:
            print("Waiting for master to finish")

    def checkpoint(self):
        print("checkpointing")
        return submitit.helpers.DelayedSubmission(self)

@hydra.main(version_base=None, config_path="../conf", config_name="slurm_config")
def main(cfg: DictConfig) -> None:
    # import pydevd_pycharm
    # pydevd_pycharm.settrace('localhost', port=5900, stdoutToServer=True, stderrToServer=True)
    executor = submitit.AutoExecutor(folder="logs")
    print(cfg)
    slurm_kwargs = {
        "slurm_job_name": cfg.slurm.job_name,
        "slurm_partition": cfg.slurm.partition,
        "slurm_nodes": cfg.slurm.n_nodes,
        "slurm_additional_parameters": {
            "gpus": cfg.slurm.n_processes,
            "ntasks_per_node": cfg.slurm.n_processes,
        },
        "slurm_cpus_per_task": 12,
        "slurm_time": cfg.slurm.time_limit,
        "slurm_exclude": cfg.slurm.exclude if cfg.slurm.exclude else "",
        "stderr_to_stdout": True,
        "slurm_mem": "10GB",
    }
    executor.update_parameters(**slurm_kwargs)

    task = Task(cfg)
    job = executor.submit(task)
    submitit.helpers.monitor_jobs([job])

if __name__ == "__main__":
    sys.exit(main())
muellerzr commented 1 year ago

Great work @yuvalkirstain! We likely wouldn't want to use submitit, considering their last commit was 6 months ago and doesn't inspire confidence. Do you know of any other SLURM management packages we should consider? Otherwise I'll look into some alternatives here once some time is available.

However for the time being this is definitely a way for users to use SLURM we'll point them to :)

yuvalkirstain commented 1 year ago

@muellerzr submitit package is maintained and many in FAIR use it. SLURM is not changing very frequently, so I would not worry about it :)

muellerzr commented 1 year ago

CC @sgugger

sgugger commented 1 year ago

Yes I'm sur many at fair use it since it's a facebookincubator project. It remains that the last commit is 6 months old. I see an issue opened 6 months ago by some folks at PyTorch Lighning using this, where there has not been a response since that time. All of this are big red flags for using this project as any kind of dependency.

muellerzr commented 1 year ago

Thanks to @lvwerra, here's a template script that can be used for doing SLURM:

#!/bin/bash
#SBATCH --job-name=XYZ
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=96
#SBATCH --gres=gpu:8
#SBATCH --exclusive
#SBATCH --partition=production-cluster
#SBATCH --output=~/logs/%x-%j.out

set -x -e

source ~/leandro/.bashrc

conda activate trl

echo "START TIME: $(date)"

# Training setup
GPUS_PER_NODE=8
# so processes know who to talk to
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
NNODES=$SLURM_NNODES
NODE_RANK=$SLURM_PROCID 
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

cd ~/git/my_project/

CMD=" \
    train.py \
    --model_name ... \
    --whatever_args_for_your_script ... \
    "

LAUNCHER="accelerate launch \
    --multi_gpu \
    --num_machines $NNODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip "$MASTER_ADDR" \
    --main_process_port $MASTER_PORT \
    --num_processes $WORLD_SIZE \
    --machine_rank \$SLURM_PROCID \
    --role $SLURMD_NODENAME: \
    --rdzv_conf rdzv_backend=c10d \
    --max_restarts 0 \
    --tee 3 \
"

# NOT SURE THE FOLLOWING ENV VARS IS STRICTLY NEEDED (PROBABLY NOT)
export CUDA_HOME=/usr/local/cuda-11.6
export LD_PRELOAD=$CUDA_HOME/lib/libnccl.so
export LD_LIBRARY_PATH=$CUDA_HOME/efa/lib:$CUDA_HOME/lib:$CUDA_HOME/lib64:$LD_LIBRARY_PATH

SRUN_ARGS=" \
    --wait=60 \
    --kill-on-bad-exit=1 \
    "

clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER $CMD" 2>&1 | tee ~/logs/%x-%j.txt

echo "END TIME: $(date)"
WeitaoVan commented 1 year ago

@muellerzr thanks for sharing your script. May I ask why I got error "IndexError: list index out of range" when excuting the command? As shown in the screenshot. I ran this code to use 2 nodes and 8 gpus per node.

image

and the script I use sbatch command to run is

image

surak commented 1 year ago

The slurm command mentioned by @muellerzr still needs a properly set $HOME/.cache/huggingface/accelerate/default_config.yaml, which is problematic when you don't know the ip addresses or any of the nodes - or even how many nodes you will use on this run.

yuvalkirstain commented 1 year ago

Checkout my repo https://github.com/yuvalkirstain/PickScore Supports multi node training with accelerate on slurm

surak commented 1 year ago

@yuvalkirstain I see that you are using submitit - but in the end, it's all a generator for a slurm script, right?

I am failing to see what's the problem on my supercomputer here.

Is there a specific thing you have to set for it to work multi-node?

surak commented 1 year ago

Basically I have this:

The only thing different is that the ip-over-infiniband interface is on the hostname with an added i. Something like the hostname being "node1" and the infiniband interface is node1i

#!/bin/bash -x
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=gpus
#SBATCH --gres=gpu:4

# srun doesnot inherit cpus-per-task from sbatch
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"
export MASTER_PORT=7010
export GPUS_PER_NODE=4
export NNODES=$SLURM_JOB_NUM_NODES

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1

# handle timeouts
export NCCL_IB_TIMEOUT=20

# Make sure we are on the right directory
cd $MYPROJECT/src

# This loads modules and python packages
source sc_venv_template/activate.sh

export LOGLEVEL=INFO

# Run the demo
time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision=no \
    --num_processes=$(($NNODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$NNODES  \
    --machine_rank=$SLURM_PROCID \
    --rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT rdzv_backend=c10d" \
    distrib.py'

The works with one node, freezes with more.

surak commented 1 year ago

Another problem of the default_config.yaml file is that it assumes that each compute node has a different directory, so the configuration will always be broken on a shared filesystem - since multiple ranks will read the same file, which would be correct only for one of the nodes.

muellerzr commented 1 year ago

@surak re; your last point, you could probably just write a collection of config.yamls that store the config for each node in a single folder and pass that in perhaps? (Using the --config_file arg)

And to get your baseline you can do something like cp ~/.cache/huggingface/accelerate/default_config.yml .

surak commented 1 year ago

Does one need a config file at all if one is using the settings as I've shown on the slurm script above?

This is much more reproducible than having files which are not part of the submission affecting the submission itself:

time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision=no \
    --num_processes=$(($NNODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$NNODES  \
    --machine_rank=$SLURM_PROCID \
    --rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT rdzv_backend=c10d" \
    distrib.py'
muellerzr commented 1 year ago

Nope, you do not. That is also extremely valid (and why the non-yaml option exists, for situations where we need to wrap/call it separately and a yaml makes it complicated)

surak commented 1 year ago

In that case, I am confused why I can't run it on my machine with multiple nodes. I opened an issue on https://github.com/huggingface/accelerate/issues/1489

PS: This machine is used up to 3744 nodes * 4 gpus/node with PyTorch DDP.

JiuhaiChen commented 2 months ago

@WeitaoVan I also have the IndexError: list index out of range issue, have you solved it?

hubutui commented 6 days ago

here is a slurm sbatch script that works for this minGPT example. just update sbatch_run.sh, and run sbatch sbatch_run.sh.

#!/bin/bash
# account to use
#SBATCH --account=<your accouont>
# job name
#SBATCH --job-name=mingpt
# partition to use
#SBATCH --partition=<partition name>
# number of nodes to use
# we use 2 nodes for ddp training
#SBATCH --nodes=2
# number of tasks per node, set it to 1 here
# we only need to start one task per node, aka the train script
#SBATCH --ntasks-per-node=1
# number of gpus per node to use, we use 1 gpu/node here for demo
#SBATCH --gpus-per-node=1
# number of cpus per gpu to use
#SBATCH --cpus-per-gpu=6
# maximum time to run the job, set it to 10 minutes for demo
#SBATCH --time=00:10:00

# activate your conda environment here
source /path/to/anaconda3/etc/profile.d/conda.sh
conda activate <envname>

rm -vf gpt_snapshot.pt
# print some useful information
echo "ibstatus: $(ibstatus)"
echo "ibdev2netdev: $(ibdev2netdev)"
echo "rdma device: $(rdma link)"
export LOGLEVEL=INFO
# choose one node as the master node for ddp training
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
# random choose a port between 30000:50000 for master node communitication
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
echo MASTER_ADDR: $MASTER_ADDR
echo MASTER_PORT: $MASTER_PORT
# enable NCCL debug info if needed for debugging
export NCCL_DEBUG=INFO
echo "environment: $(env | grep NCCL)"
# enable IB native support or not
# export NCCL_IB_DISABLE=0
# which device to use for communitication between nodes
# if NCCL_IB_DISABLE=0, set NCCL_IB_HCA to the device `rdma link` show if nccl could not find one automatically
# export NCCL_IB_HCA=
# if NCCL_IB_DISABLE=1, set NCCL_SOCKET_IFNAME to the device `ibdev2netdev` or `ip link show` show if nccl could not find one automatically
# export NCCL_SOCKET_IFNAME=
# export NCCL_TOPO_DUMP_FILE=topo.xml

# torchrun
# srun --label torchrun \
#     --nnodes $SLURM_NNODES \
#     --nproc_per_node $SLURM_GPUS_PER_NODE \
#     --rdzv_id $RANDOM \
#     --rdzv_backend c10d \
#     --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
#     ../main.py

# python -m torch.distributed.run
# srun --label python -m torch.distributed.run \
#     --nnodes $SLURM_NNODES \
#     --nproc_per_node $SLURM_GPUS_PER_NODE \
#     --rdzv_id $RANDOM \
#     --rdzv_backend c10d \
#     --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
#     ../main.py

# accelerate
num_processes=$((SLURM_NNODES * SLURM_GPUS_PER_NODE))
srun --label accelerate launch \
    --multi_gpu \
    --rdzv_backend c10d \
    --machine_rank $SLURM_NODEID \
    --num_processes $num_processes \
    --num_machines $SLURM_NNODES \
    --dynamo_backend no \
    --mixed_precision no \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    ../main.py

BTW, setting NCCL_DEBUG=INFO, NCCL_IB_DISABLE=1 might help if the experiment stuck. There might be something wrong with the IB setting.