Any example script to run multi-node training for slurm?

wavy-jung commented 1 month ago

Hi, I was trying to run multi-node training on slurm nodes but I have no idea how to configure composer arguments and commands. Is there any example script to run training on slurm nodes with composer?

dakinggg commented 1 month ago

We don't have a slurm example, but here are the environment variables that the composer launcher sets/requires: https://github.com/mosaicml/composer/blob/6d4628a1043d1f118dc38eb359ede5524e0a9aa0/composer/cli/launcher.py#L344-L352. It should just be the normal torch distributed env vars.

And here are the env vars that mcli sets for you: https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/environment.html#runtime-environment-variables

wavy-jung commented 1 month ago

Thanks for helping me! @dakinggg I have configured the environment variables as you provided in the attached link and ran it. Below is the script I ran for the training:

#!/bin/bash
#SBATCH --job-name=wavy-llmfoundry-test
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=8G
#SBATCH --gres=gpu:8
#SBATCH --output=slurm-logs/%x-%j.out

GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
MASTER_PORT=19963
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
WORK_DIR="/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/llm-foundry"

export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_LAUNCH_BLOCKING=1
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=INFO

export RANK=$NNODES
export WORLD_SIZE=$WORLD_SIZE
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT
export LOCAL_WORLD_SIZE=$GPUS_PER_NODE
export NUM_NODES=$NNODES

export LAUNCHER="composer --world_size $WORLD_SIZE \
    --master_addr $MASTER_ADDR \
    --master_port 19963"

export CMD="$WORK_DIR/scripts/train/train.py \
    $WORK_DIR/scripts/train/yamls/pretrain/llama3-8b.yaml"

srun \
--container-image /mnt/datafs/ib-a100-cluster-a-pri/lmt/images/wavy-llm-foundry-v0.10.0.sqsh \
--container-mounts /mnt/datafs:/mnt/datafs \
--container-workdir $WORK_DIR \
--jobid $SLURM_JOBID \
bash -c "export NODE_RANK=$SLURM_PROCID && $LAUNCHER --node_rank $SLURM_PROCID $CMD \
    save_folder=/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/checkpoints/composer/llama3-8b-slurm"

However, the error below was thrown:

So I tried with torchrun launcher and it passes the initialization stage, but stuck in the tokenizer building stage like below:

# export LAUNCHER="composer --world_size $WORLD_SIZE \
#     --master_addr $MASTER_ADDR \
#     --master_port 19963"

export LAUNCHER="torchrun \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d "

I would like to ask if there are any potential causes you can think of. +) The yaml file used for the training is very similar to the example files and so I assume the problem is nothing to do with the yaml file! +) sqsh image used for the training was built upon the latest version of docker image and added layer with pip install -e ".[gpu]" to setup the llmfoundry.

dakinggg commented 1 month ago

Ah looks like an issue on a shared fs (see https://github.com/mosaicml/llm-foundry/pull/1253#issuecomment-2164037723 for more discussion of this). I haven't quite finished fixing that yet.

dakinggg commented 1 month ago

Could you try this PR? https://github.com/mosaicml/llm-foundry/pull/1381. You may also need composer with this pr https://github.com/mosaicml/composer/pull/3485

wavy-jung commented 1 month ago

@dakinggg Thanks! I'll try with those PRs

dmakhervaks commented 1 month ago

@dakinggg It seems that 1381 was reverted -> https://github.com/mosaicml/llm-foundry/commit/221d3e2bfa641d007b2c666dd0402d57de0593ff

I tried pulling the latest docker image (mosaicml/llm-foundry:2.3.1_cu121-e882658) but I am still getting this error when trying to run in a multi-node setting:

[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank7]: Last error: [rank7]: socketStartConnect: Connect to 172.20.1.119<43385> failed : Software caused connection abort

Is this expected? Thanks in advance!

dakinggg commented 1 month ago

Yes, we will reapply soon, but you can still try with that PR. unhandled error seems different though and suggest your distributed env is not set up correctly

mosaicml / llm-foundry

Any example script to run multi-node training for slurm? #1378