meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
9.95k stars 1.4k forks source link

Multi-Node Issue #491

Closed guyuchao closed 3 weeks ago

guyuchao commented 3 weeks ago

Why in the given multi_node.slurm, you use the "#SBATCH --nodes=2", but you don't sepcify the --nnodes in "srun torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py --enable_fsdp --use_peft --peft_method lora".

Is it a bug or some specific concern.

guyuchao commented 3 weeks ago

when I start my multi-node job using

"srun torchrun --nnodes 2 --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py --enable_fsdp --use_peft --peft_method lora".

I encounter the following bugs, do you know what causes this.

rank7: File "/opt/hpcaas/.mounts/fs-0f5a75f2b23e4bb75/yuchaogu/projects_llm/try_llm/llama-recipes/recipes/finetuning/finetuning.py", line 8, in

rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/fire/core.py", line 143, in Fire rank7: component_trace = _Fire(component, args, parsed_flag_args, context, name) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire rank7: component, remaining_args = _CallAndUpdateTrace( rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace rank7: component = fn(varargs, kwargs) rank7: File "/opt/hpcaas/.mounts/fs-0f5a75f2b23e4bb75/yuchaogu/projects_llm/try_llm/llama-recipes/src/llama_recipes/finetuning.py", line 268, in main rank7: results = train( rank7: File "/opt/hpcaas/.mounts/fs-0f5a75f2b23e4bb75/yuchaogu/projects_llm/try_llm/llama-recipes/src/llama_recipes/utils/train_utils.py", line 151, in train rank7: loss = model(batch).loss rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank7: return self._call_impl(args, *kwargs) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank7: return forward_call(args, **kwargs) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 843, in forward rank7: args, kwargs = _pre_forward( rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 380, in _pre_forward rank7: unshard_fn(state, handle) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 415, in _pre_forward_unshard rank7: _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 299, in _unshard

rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1308, in unshard rank7: padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1399, in _all_gather_flat_param

rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper rank7: return func(*args, **kwargs) rank7: File "/data/home/yuchaogu/miniconda3/envs/llama_recipes2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2948, in all_gather_into_tensor rank7: work = group._allgather_base(output_tensor, input_tensor, opts) rank7: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608839953/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 rank7: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. rank7: Last error: rank7: NET/OFI Unable to register memory (type = 2) for device 1. RC: -22, Error: Invalid argument

HamidShojanazeri commented 3 weeks ago

@guyuchao , we should use srun torchrun --nnodes its been missing from the script, please feel free to send a PR. ideally it should be similar to this

#SBATCH --job-name=...

#SBATCH --ntasks=4

#SBATCH --nodes=4

#SBATCH --gpus-per-task=8

#SBATCH --cpus-per-task=96

#SBATCH --partition=train

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# Enable for A100
export FI_PROVIDER="efa"
# Ensure that P2P is available
# export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1

# debugging flags (optional)
export NCCL_DEBUG=WARN
export PYTHONFAULTHANDLER=1
# optional debug settings
# export NCCL_DEBUG=INFO
# NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV

export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
export CUDA_LAUNCH_BLOCKING=0

# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
export NCCL_BUFFSIZE=2097152
#export TORCH_DIST_INIT_BARRIER=1
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
#export USE_LIBUV=1

dcgmi profile --pause
# adjust sbatch --ntasks and sbatch --nodes above and --nnodes below
# to your specific node count, and update target launch file.
srun torchrun --nnodes 4 --nproc_per_node 8 --rdzv_id 101 --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:29500" .finetuning.py  --enable_fsdp --use_peft --peft_method lora 
dcgmi profile --resume
guyuchao commented 3 weeks ago

That perfectly solved my issue. Thanks.