mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.96k stars 519 forks source link

Any example script to run multi-node training for slurm? #1378

Open wavy-jung opened 1 month ago

wavy-jung commented 1 month ago

Hi, I was trying to run multi-node training on slurm nodes but I have no idea how to configure composer arguments and commands. Is there any example script to run training on slurm nodes with composer?

dakinggg commented 1 month ago

We don't have a slurm example, but here are the environment variables that the composer launcher sets/requires: https://github.com/mosaicml/composer/blob/6d4628a1043d1f118dc38eb359ede5524e0a9aa0/composer/cli/launcher.py#L344-L352. It should just be the normal torch distributed env vars.

And here are the env vars that mcli sets for you: https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/environment.html#runtime-environment-variables

wavy-jung commented 1 month ago

Thanks for helping me! @dakinggg I have configured the environment variables as you provided in the attached link and ran it. Below is the script I ran for the training:

#!/bin/bash
#SBATCH --job-name=wavy-llmfoundry-test
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=8G
#SBATCH --gres=gpu:8
#SBATCH --output=slurm-logs/%x-%j.out

GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
MASTER_PORT=19963
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
WORK_DIR="/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/llm-foundry"

export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_LAUNCH_BLOCKING=1
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=INFO

export RANK=$NNODES
export WORLD_SIZE=$WORLD_SIZE
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT
export LOCAL_WORLD_SIZE=$GPUS_PER_NODE
export NUM_NODES=$NNODES

export LAUNCHER="composer --world_size $WORLD_SIZE \
    --master_addr $MASTER_ADDR \
    --master_port 19963"

export CMD="$WORK_DIR/scripts/train/train.py \
    $WORK_DIR/scripts/train/yamls/pretrain/llama3-8b.yaml"

srun \
--container-image /mnt/datafs/ib-a100-cluster-a-pri/lmt/images/wavy-llm-foundry-v0.10.0.sqsh \
--container-mounts /mnt/datafs:/mnt/datafs \
--container-workdir $WORK_DIR \
--jobid $SLURM_JOBID \
bash -c "export NODE_RANK=$SLURM_PROCID && $LAUNCHER --node_rank $SLURM_PROCID $CMD \
    save_folder=/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/checkpoints/composer/llama3-8b-slurm"

However, the error below was thrown: image

So I tried with torchrun launcher and it passes the initialization stage, but stuck in the tokenizer building stage like below:

# export LAUNCHER="composer --world_size $WORLD_SIZE \
#     --master_addr $MASTER_ADDR \
#     --master_port 19963"

export LAUNCHER="torchrun \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d "

image I would like to ask if there are any potential causes you can think of. +) The yaml file used for the training is very similar to the example files and so I assume the problem is nothing to do with the yaml file! +) sqsh image used for the training was built upon the latest version of docker image and added layer with pip install -e ".[gpu]" to setup the llmfoundry.

dakinggg commented 1 month ago

Ah looks like an issue on a shared fs (see https://github.com/mosaicml/llm-foundry/pull/1253#issuecomment-2164037723 for more discussion of this). I haven't quite finished fixing that yet.

dakinggg commented 1 month ago

Could you try this PR? https://github.com/mosaicml/llm-foundry/pull/1381. You may also need composer with this pr https://github.com/mosaicml/composer/pull/3485

wavy-jung commented 1 month ago

@dakinggg Thanks! I'll try with those PRs

dmakhervaks commented 1 month ago

@dakinggg It seems that 1381 was reverted -> https://github.com/mosaicml/llm-foundry/commit/221d3e2bfa641d007b2c666dd0402d57de0593ff

I tried pulling the latest docker image (mosaicml/llm-foundry:2.3.1_cu121-e882658) but I am still getting this error when trying to run in a multi-node setting:

[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank7]: Last error: [rank7]: socketStartConnect: Connect to 172.20.1.119<43385> failed : Software caused connection abort

Is this expected? Thanks in advance!

dakinggg commented 1 month ago

Yes, we will reapply soon, but you can still try with that PR. unhandled error seems different though and suggest your distributed env is not set up correctly