microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.71k stars 4.05k forks source link

ModuleNotFoundError with Multi-node training using SLURM #3489

Open macabdul9 opened 1 year ago

macabdul9 commented 1 year ago

I am trying to train models on multiple nodes with SLURM as a workload manager. The Issue seems to be with the Python virtual environment not available to all nodes. Please find more details below.

Job script:

#!/bin/bash
#SBATCH --time=10:00
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --cpus-per-task=48
#SBATCH --gres=gpu:4
#SBATCH --mem=0

export NPROC_PER_NODE=4
export OUTPUT_DIR=./output/

export NCCL_DEBUG=INFO
export HDF5_USE_FILE_LOCKING='FALSE'
export PARENT=`/bin/hostname -s`
export MPORT=13001
export CHILDREN=`scontrol show hostnames $SLURM_JOB_NODELIST | grep -v $PARENT`
export HOSTLIST="$PARENT $CHILDREN"
echo $HOSTLIST
export WORLD_SIZE=$SLURM_NTASKS

module load gcc arrow python/3.8.10 ffmpeg/4.3.2 cuda
source ~/venv/bin/activate

srun distributed_runner_ds.sh

Training script (distributed_runner_ds.sh)

#!/bin/bash
/bin/hostname -s
export NCCL_BLOCKING_WAIT=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0

#replaces the content of hostfile every time
function makehostfile() {
perl -e '$slots=split /,/, $ENV{"SLURM_STEP_GPUS"};
$slots=4 if $slots==0; # workaround 8 gpu machines
@nodes = split /\n/, qx[scontrol show hostnames $ENV{"SLURM_JOB_NODELIST"}];
print map { "$b$_ slots=$slots\n" } @nodes'
}
makehostfile > hostfile

deepspeed --num_gpus=$(($NPROC_PER_NODE * $SLURM_JOB_NUM_NODES)) --num_nodes=$SLURM_JOB_NUM_NODES  --master_addr="$PARENT" --master_port="$MPORT" --hostfile hostfile train.py \
    --model_name_or_path "EleutherAI/gpt-j-6b" \
    --data_path mbzuai-distil/instruction \
    --output_dir ./output/ \
    --cache_dir ./cache \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing \
    --report_to="none" \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 100 \
    --deepspeed "ds_config2.json" \
    --debugging True \

Hostfile:

ng30905 slots=4
ng31103 slots=4

Logs:

nohup: ignoring input
ng30905 ng31103
ng30905
ng31103
Num of node, 2
Num of GPU per node, 4
PROCID: 0
LOCALID: 0
Num of node, 2
Num of GPU per node, 4
PROCID: 1
LOCALID: 0
[2023-05-03 10:46:21,278] [INFO] [multinode_runner.py:67:get_cmd] Running on the following workers: ng30905,ng31103
[2023-05-03 10:46:21,279] [INFO] [runner.py:550:main] cmd = pdsh -S -f 1024 -w ng30905,ng31103 export _NCCL_BLOCKING_WAIT=1; export NCCL_IB_DISABLE=1; export PYTHONPATH=/lustre07/scratch/awaheed/InstructTuning:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/home/awaheed/venv/lib/python3.8/site-packages:/home/awaheed/venv/lib/python3.8/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages; export NCCL_DEBUG=INFO; export NCCL_SOCKET_IFNAME=eth0;  cd /lustre07/scratch/awaheed/InstructTuning; /home/awaheed/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJuZzMwOTA1IjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAibmczMTEwMyI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XX0= --node_rank=%n --master_addr=ng30905 --master_port=29500 train.py --model_name_or_path 'EleutherAI/gpt-j-6b' --data_path 'mbzuai-distil/instruction' --output_dir './output/' --cache_dir './cache' --num_train_epochs '5' --per_device_train_batch_size '8' --per_device_eval_batch_size '8' --gradient_accumulation_steps '4' --gradient_checkpointing --report_to=none --evaluation_strategy 'no' --save_strategy 'steps' --save_steps '1000' --learning_rate '2e-5' --weight_decay '0.' --warmup_ratio '0.03' --lr_scheduler_type 'cosine' --logging_steps '100' --deepspeed 'ds_config2.json' --debugging 'True'_
[2023-05-03 10:46:21,466] [INFO] [multinode_runner.py:67:get_cmd] Running on the following workers: ng30905,ng31103
[2023-05-03 10:46:21,467] [INFO] [runner.py:550:main] cmd = pdsh -S -f 1024 -w ng30905,ng31103 export NCCL_BLOCKING_WAIT=1; export NCCL_IB_DISABLE=1; export PYTHONPATH=/lustre07/scratch/awaheed/InstructTuning:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/home/awaheed/venv/lib/python3.8/site-packages:/home/awaheed/venv/lib/python3.8/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages; export NCCL_DEBUG=INFO; export NCCL_SOCKET_IFNAME=eth0;  cd /lustre07/scratch/awaheed/InstructTuning; /home/awaheed/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJuZzMwOTA1IjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAibmczMTEwMyI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XX0= --node_rank=%n --master_addr=ng30905 --master_port=29500 train.py --model_name_or_path 'EleutherAI/gpt-j-6b' --data_path 'mbzuai-distil/instruction' --output_dir './output/' --cache_dir './cache' --num_train_epochs '5' --per_device_train_batch_size '8' --per_device_eval_batch_size '8' --gradient_accumulation_steps '4' --gradient_checkpointing --report_to=none --evaluation_strategy 'no' --save_strategy 'steps' --save_steps '1000' --learning_rate '2e-5' --weight_decay '0.' --warmup_ratio '0.03' --lr_scheduler_type 'cosine' --logging_steps '100' --deepspeed 'ds_config2.json' --debugging 'True'
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:135:main] 0 NCCL_BLOCKING_WAIT=1
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:135:main] 0 NCCL_IB_DISABLE=1
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:135:main] 0 NCCL_DEBUG=INFO
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:135:main] 0 NCCL_SOCKET_IFNAME=eth0
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:142:main] WORLD INFO DICT: {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [0, 1, 2, 3, 4, 5, 6, 7]}
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=8, node_rank=0
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [8, 9, 10, 11, 12, 13, 14, 15]})
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:162:main] dist_world_size=16
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:135:main] 1 NCCL_BLOCKING_WAIT=1
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:135:main] 1 NCCL_IB_DISABLE=1
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:135:main] 1 NCCL_DEBUG=INFO
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:135:main] 1 NCCL_SOCKET_IFNAME=eth0
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:142:main] WORLD INFO DICT: {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [0, 1, 2, 3, 4, 5, 6, 7]}
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=8, node_rank=1
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [8, 9, 10, 11, 12, 13, 14, 15]})
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:162:main] dist_world_size=16
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:135:main] 0 NCCL_BLOCKING_WAIT=1
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:135:main] 0 NCCL_IB_DISABLE=1
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:135:main] 0 NCCL_DEBUG=INFO
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:135:main] 0 NCCL_SOCKET_IFNAME=eth0
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:142:main] WORLD INFO DICT: {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [0, 1, 2, 3, 4, 5, 6, 7]}
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=8, node_rank=0
ng30905: [2023-05-03 10:46:23,987] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [8, 9, 10, 11, 12, 13, 14, 15]})
ng30905: [2023-05-03 10:46:23,987] [INFO] [launch.py:162:main] dist_world_size=16
ng30905: [2023-05-03 10:46:23,987] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:135:main] 1 NCCL_BLOCKING_WAIT=1
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:135:main] 1 NCCL_IB_DISABLE=1
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:135:main] 1 NCCL_DEBUG=INFO
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:135:main] 1 NCCL_SOCKET_IFNAME=eth0
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:142:main] WORLD INFO DICT: {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [0, 1, 2, 3, 4, 5, 6, 7]}
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=8, node_rank=1
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [8, 9, 10, 11, 12, 13, 14, 15]})
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:162:main] dist_world_size=16
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ng30905: Traceback (most recent call last):
ng30905:   File "/home/awaheed/venv/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1146, in _get_module
ng30905:     return importlib.import_module("." + module_name, self.__name__)
ng30905:   File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/importlib/__init__.py", line 127, in import_module
ng30905:     return _bootstrap._gcd_import(name[level:], package, level)
ng30905:   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
ng30905:   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
ng30905:   File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
ng30905:   File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
ng30905:   File "<frozen importlib._bootstrap_external>", line 848, in exec_module
ng30905:   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ng30905:   File "/home/awaheed/venv/lib/python3.8/site-packages/transformers/trainer.py", line 176, in <module>
ng30905:     import datasets
ng30905:   File "/home/awaheed/venv/lib/python3.8/site-packages/datasets/__init__.py", line 24, in <module>
ng30905:     import pyarrow
ng30905: ModuleNotFoundError: No module named 'pyarrow'

Mode Details:

CC: @loadams @tjruwase @RezaYazdaniAminabadi @HeyangQin @jeffra @ShadenSmith @samyam @molly-smith @arashashari @arashb Help is much appreciated. Thanks.

Sparklexs commented 1 year ago

I'm facing a similar situation with you. I tried to fine tune ChatGLM (a Chinese LLM) via ds inside Slurm, using only one node with 4 gpus (sbatch --gpus=4 xxx.sh), it seems DeepSpeed is using multithread to revoke main() in the python script 4 times, so I always got FileNotFoundError when initializing tokenizer and model from the cache files,which are indeed there. I believe this error is caused by thread conflict, cause when I set the CUDA_VISIBLE_DEVICES to only one, all the FileNotFoundError is just gone but the OOM Error, and when the gpus add up to 2, the FileNotFoundError appears sometimes depending on the thread conflict. And here is the environment:

Here is the eroor I encountered: image Here is the slurm script: image

macabdul9 commented 1 year ago

For me, multi-GPU on a single node works fine. I get that error when I try to train on multiple nodes where not all the nodes have access to the correct virtual environment. @loadams @tjruwase @RezaYazdaniAminabadi @HeyangQin

macabdul9 commented 1 year ago

Please help @jeffra @ShadenSmith @samyam @molly-smith @arashashari @arashb