Closed jubueche closed 3 months ago
Ok I resolved my issues finally. The problem was in how I launch the workers. I forgot to use srun
so I was just starting my script on the main node.
Also, in the bash file used for launching the command, you have to use \$SLURM_NODEID
for the machine rank and not $SLURM_NODEID
. For me, $SLURM_NODEID
always resulted in 0, so every instance got launched thinking it was the master process. That is why I got a timeout.
I think a much better documentation or a tutorial would make sense here.
Would you like to adjust the documentation in the examples/ folder @jubueche? 🤗
I don't have time right now, but when I find some (maybe on the weekend) I can look into it.
Great! @jubueche May I ask for some clarification? I'm encountering very similar problems to those you described. I utilize the same DeepZero config as you. Currently, I'm running into the following error:
W0635 14:04:15.501000 139665687570304 torch/distributed/run.py:749] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
This is the same error you specified before. finetune.slurm:
#!/bin/bash
#SBATCH --output=/home/username/output/%j.out
#SBATCH --error=/home/username/error/%j.err
#SBATCH --time=2-00:00:00 # Set the time limit to 2 days
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
#SBATCH --partition=h100x4
#SBATCH --gres=gpu:4
# Load necessary modules or environment variables
# module load gcc/9.3.0/1
source /home/username/env/bin/activate
# Change to project directory
cd /home/username/project
# Display GPU status
nvidia-smi
echo "Running on node: $(hostname)"
echo "In directory: $(pwd)"
echo "Starting on: $(date)"
echo "SLURM_JOB_ID: ${SLURM_JOB_ID}"
export NCCL_DEBUG=DEBUG
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_IP=$(ping -c 1 $MASTER_ADDR | grep PING | awk '{print $3}' | tr -d '()')
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
# Debugging: Check network configuration
echo "WORLD_SIZE="$WORLD_SIZE
echo "MASTER_ADDR="$MASTER_ADDR
echo "MASTER_PORT="$MASTER_PORT
echo "MASTER_IP="$MASTER_IP
ping -c 3 $MASTER_IP # Ensure the MASTER_IP is reachable
# Run training with DeepSpeed integration
accelerate launch \
--multi_gpu \
--dynamo_backend no \
--mixed_precision fp16 \
--num_processes $(($SLURM_NTASKS_PER_NODE * $SLURM_NNODES)) \
--gpu_ids all \
--num_machines 2 \
--machine_rank $SLURM_NODEID \
--rdzv_backend c10d \
--same_network \
--deepspeed_config_file /home/username/project/deepspeed_configs/zero2-la.json \
-m project.cli.train /home/username/project/config.yaml
echo "Finished at: $(date)"
I'm currently using sbatch finetune.slurm
to launch. Could you specify how exactly you executed this script with srun
?
\$SLURM_NODEID
instead of '$SLURM_NODEID' does not work in my case. Just throws and error since it doesn't try to read the Variable but the raw string.
System Info
I know my linux kernel is old, but nothing I can do about that since I'm not the admin on the cluster.
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I want to do multi-node training with 2 nodes and 8 V100s per node. For that, I am using
accelerate launch
,Slurm
, andDeepSpeed
. This is my batch file that gets executed usingsbatch ...
which then submits the job.A few notes and questions:
export NCCL_DEBUG=DEBUG
doesn't seem to do anything for me. I searched online but couldn't find an answer.export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
this led to port 12511 which should be fine.MASTER_ADDR
is set tonpl19
in this case. I verified that I canssh
intonpl19
without password.--num_processes 16 \
this is correct, right? I am launching on 2 nodes with 8 GPUs each, so 16 in total.--machine_rank $SLURM_NODEID \
when Slurm launches this, does it somehow pass different ranks on the different nodes or does it just execute on the main node and passes 0? Not sure how it works internally.--rdzv_backend c10d \
I got the warning thatW0531 14:04:15.501000 139665687570304 torch/distributed/run.py:749] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
That seems a bit weird to me. If I don't specify the master address, how do the other processes know where to send the data to? I also didn't explicitly pass the master address. I just set the env variable.--same_network \
how do I check whether this is the case for me? Is it sufficient to be able to ssh into another node? I guess not right?Is there a global env so that I can enable debugging logs for DeepSpeed? I could only find the
comms_logger
attribute but it didn't do anything for me (maybe I haven't gotten to a point where it should print anything). I saw online that"deepspeed_multinode_launcher": "standard",
is important to specify. The problem is thatpdsh
is not installed on my system so I had to usestandard
. Is this a problem? How does this interplay with SLURM?What I have tried so far: I wanted to try the
nlp_example.py
from the examples oftransformers
. I created two nodes and ssh'ed into both of them. I then created two accelerate configs, one for each node. This is for rank 0:and the one for rank 1 is the same, except that
machine_rank: 1
. I then launch both on each node usingaccelerate launch --config <path to configs>
. And that seems to work (I think). Both processes start the script. Rank 0 prints the training loss and accuracy, and Rank 1 doesn't do anything, but then tries to do evaluation herebut runs into a race condition (essentially looks for a lock file that is not there I think). Should I see print statements of the losses from both ranks? If so, then that doesn't work either. Is it correct that I launched both scripts manually? I read somewhere that you need to do that when you use
deepspeed_multinode_launcher: standard
.I have a lot of questions and few answers 🥲
Expected behavior
I would like to make multi-node training work.