huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.77k stars 941 forks source link

[DeepSpeed + Slurm + Accelerate] Rendez vous timeout error, questions about correct setup #2816

Closed jubueche closed 3 months ago

jubueche commented 4 months ago

System Info

- `Accelerate` version: 0.30.1
- Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17
- `accelerate` bash location: /gpfs/u/home/ANFM/ANFMbchl/scratch/miniconda3/envs/torch-nightly/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 755.37 GB
- GPU type: Tesla V100-SXM2-32GB
- `Accelerate` default config:
        Not found

I know my linux kernel is old, but nothing I can do about that since I'm not the admin on the cluster.

Information

Tasks

Reproduction

I want to do multi-node training with 2 nodes and 8 V100s per node. For that, I am using accelerate launch, Slurm, and DeepSpeed. This is my batch file that gets executed using sbatch ... which then submits the job.

#!/bin/bash
#SBATCH --output=/gpfs/u/home/ANFM/ANFMbchl/scratch-shared/anfm/phi3/pretrain/revolving_manchester/%j.out
#SBATCH --error=/gpfs/u/home/ANFM/ANFMbchl/scratch-shared/anfm/phi3/pretrain/revolving_manchester/%j.err
#SBATCH --time=10
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
#SBATCH --partition=npl-2024

#SBATCH --gres=gpu:8

nvidia-smi

echo "Running on node: $(hostname)"
echo "In directory:    $(pwd)"
echo "Starting on:     $(date)"
echo "SLURM_JOB_ID:    ${SLURM_JOB_ID}"

module load gcc/9.3.0/1
export WANDB_CACHE_DIR=$HOME/scratch/.cache
export WANDB_DATA_DIR=$HOME/scratch/.cache
export WANDB_DIR=$HOME/scratch/.cache
export WANDB_CONFIG_DIR=$HOME/scratch/.cache
export TMPDIR=$HOME/scratch/.cache
export MKL_SERVICE_FORCE_INTEL=1
export MAX_JOBS=8
export TRITON_CACHE_DIR=$HOME/scratch/.cache

export NCCL_DEBUG=DEBUG
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE
echo "MASTER_ADDR="$MASTER_ADDR
echo "MASTER_PORT="$MASTER_PORT

accelerate launch \
 --multi_gpu \
 --dynamo_backend no \
 --mixed_precision fp16 \
 --num_processes 16 \
 --gpu_ids all \
 --num_machines 2 \
 --machine_rank $SLURM_NODEID \
 --rdzv_backend c10d \
 --same_network \
 --deepspeed_config_file /gpfs/u/home/ANFM/ANFMbchl/scratch-shared/anfm/phi3/pretrain/revolving_manchester/ds_config.json \
 train.py --config /gpfs/u/home/ANFM/ANFMbchl/scratch-shared/anfm/phi3/pretrain/revolving_manchester/config.yaml

echo "Finished at:      $(date)"

A few notes and questions:

What I have tried so far: I wanted to try the nlp_example.py from the examples of transformers. I created two nodes and ssh'ed into both of them. I then created two accelerate configs, one for each node. This is for rank 0:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_hostfile: /gpfs/u/home/ANFM/ANFMbchl/scratch/phi3-anfm/ds_hostfile
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
  comms_logger:
    enabled: true
    verbose: true
    prof_all: true
    debug: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_process_ip: npl02
main_process_port: 6000
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

and the one for rank 1 is the same, except that machine_rank: 1. I then launch both on each node using accelerate launch --config <path to configs>. And that seems to work (I think). Both processes start the script. Rank 0 prints the training loss and accuracy, and Rank 1 doesn't do anything, but then tries to do evaluation here

for step, batch in enumerate(eval_dataloader):
    # We could avoid this line since we set the accelerator with `device_placement=True`.
    batch.to(accelerator.device)
    with torch.no_grad():
        outputs = model(**batch)
    predictions = outputs.logits.argmax(dim=-1)
    predictions, references = accelerator.gather_for_metrics((predictions, batch["labels"]))
    metric.add_batch(
        predictions=predictions,
        references=references,
    )

eval_metric = metric.compute() <----- HERE

but runs into a race condition (essentially looks for a lock file that is not there I think). Should I see print statements of the losses from both ranks? If so, then that doesn't work either. Is it correct that I launched both scripts manually? I read somewhere that you need to do that when you use deepspeed_multinode_launcher: standard.

I have a lot of questions and few answers 🥲

Expected behavior

I would like to make multi-node training work.

jubueche commented 3 months ago

Ok I resolved my issues finally. The problem was in how I launch the workers. I forgot to use srun so I was just starting my script on the main node. Also, in the bash file used for launching the command, you have to use \$SLURM_NODEID for the machine rank and not $SLURM_NODEID. For me, $SLURM_NODEID always resulted in 0, so every instance got launched thinking it was the master process. That is why I got a timeout. I think a much better documentation or a tutorial would make sense here.

muellerzr commented 3 months ago

Would you like to adjust the documentation in the examples/ folder @jubueche? 🤗

jubueche commented 3 months ago

I don't have time right now, but when I find some (maybe on the weekend) I can look into it.

juvi21 commented 3 months ago

Great! @jubueche May I ask for some clarification? I'm encountering very similar problems to those you described. I utilize the same DeepZero config as you. Currently, I'm running into the following error:

W0635 14:04:15.501000 139665687570304 torch/distributed/run.py:749] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.

This is the same error you specified before. finetune.slurm:

#!/bin/bash
#SBATCH --output=/home/username/output/%j.out
#SBATCH --error=/home/username/error/%j.err
#SBATCH --time=2-00:00:00  # Set the time limit to 2 days
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
#SBATCH --partition=h100x4
#SBATCH --gres=gpu:4

# Load necessary modules or environment variables
# module load gcc/9.3.0/1
source /home/username/env/bin/activate

# Change to project directory
cd /home/username/project

# Display GPU status
nvidia-smi
echo "Running on node: $(hostname)"
echo "In directory:    $(pwd)"
echo "Starting on:     $(date)"
echo "SLURM_JOB_ID:    ${SLURM_JOB_ID}"

export NCCL_DEBUG=DEBUG
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_IP=$(ping -c 1 $MASTER_ADDR | grep PING | awk '{print $3}' | tr -d '()')
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

# Debugging: Check network configuration

echo "WORLD_SIZE="$WORLD_SIZE
echo "MASTER_ADDR="$MASTER_ADDR
echo "MASTER_PORT="$MASTER_PORT
echo "MASTER_IP="$MASTER_IP
ping -c 3 $MASTER_IP  # Ensure the MASTER_IP is reachable

# Run training with DeepSpeed integration
accelerate launch \
 --multi_gpu \
 --dynamo_backend no \
 --mixed_precision fp16 \
 --num_processes $(($SLURM_NTASKS_PER_NODE * $SLURM_NNODES)) \
 --gpu_ids all \
 --num_machines 2 \
 --machine_rank $SLURM_NODEID \
 --rdzv_backend c10d \
 --same_network \
 --deepspeed_config_file /home/username/project/deepspeed_configs/zero2-la.json \
 -m project.cli.train /home/username/project/config.yaml

echo "Finished at:      $(date)"

I'm currently using sbatch finetune.slurm to launch. Could you specify how exactly you executed this script with srun?

\$SLURM_NODEID instead of '$SLURM_NODEID' does not work in my case. Just throws and error since it doesn't try to read the Variable but the raw string.