Closed guyuchao closed 3 weeks ago
when I start my multi-node job using
"srun torchrun --nnodes 2 --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py --enable_fsdp --use_peft --peft_method lora".
I encounter the following bugs, do you know what causes this.
rank7: File "/opt/hpcaas/.mounts/fs-0f5a75f2b23e4bb75/yuchaogu/projects_llm/try_llm/llama-recipes/recipes/finetuning/finetuning.py", line 8, in
rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/fire/core.py", line 143, in Fire rank7: component_trace = _Fire(component, args, parsed_flag_args, context, name) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire rank7: component, remaining_args = _CallAndUpdateTrace( rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace rank7: component = fn(varargs, kwargs) rank7: File "/opt/hpcaas/.mounts/fs-0f5a75f2b23e4bb75/yuchaogu/projects_llm/try_llm/llama-recipes/src/llama_recipes/finetuning.py", line 268, in main rank7: results = train( rank7: File "/opt/hpcaas/.mounts/fs-0f5a75f2b23e4bb75/yuchaogu/projects_llm/try_llm/llama-recipes/src/llama_recipes/utils/train_utils.py", line 151, in train rank7: loss = model(batch).loss rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank7: return self._call_impl(args, *kwargs) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank7: return forward_call(args, **kwargs) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 843, in forward rank7: args, kwargs = _pre_forward( rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 380, in _pre_forward rank7: unshard_fn(state, handle) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 415, in _pre_forward_unshard rank7: _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 299, in _unshard
rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1308, in unshard rank7: padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1399, in _all_gather_flat_param
rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper rank7: return func(*args, **kwargs) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2948, in all_gather_into_tensor rank7: work = group._allgather_base(output_tensor, input_tensor, opts) rank7: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608839953/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 rank7: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. rank7: Last error: rank7: NET/OFI Unable to register memory (type = 2) for device 1. RC: -22, Error: Invalid argument
@guyuchao , we should use srun torchrun --nnodes
its been missing from the script, please feel free to send a PR. ideally it should be similar to this
#SBATCH --job-name=...
#SBATCH --ntasks=4
#SBATCH --nodes=4
#SBATCH --gpus-per-task=8
#SBATCH --cpus-per-task=96
#SBATCH --partition=train
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# Enable for A100
export FI_PROVIDER="efa"
# Ensure that P2P is available
# export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
# debugging flags (optional)
export NCCL_DEBUG=WARN
export PYTHONFAULTHANDLER=1
# optional debug settings
# export NCCL_DEBUG=INFO
# NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
export CUDA_LAUNCH_BLOCKING=0
# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
export NCCL_BUFFSIZE=2097152
#export TORCH_DIST_INIT_BARRIER=1
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
#export USE_LIBUV=1
dcgmi profile --pause
# adjust sbatch --ntasks and sbatch --nodes above and --nnodes below
# to your specific node count, and update target launch file.
srun torchrun --nnodes 4 --nproc_per_node 8 --rdzv_id 101 --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:29500" .finetuning.py --enable_fsdp --use_peft --peft_method lora
dcgmi profile --resume
That perfectly solved my issue. Thanks.
Why in the given multi_node.slurm, you use the "#SBATCH --nodes=2", but you don't sepcify the --nnodes in "srun torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py --enable_fsdp --use_peft --peft_method lora".
Is it a bug or some specific concern.