meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.
15.14k stars 2.19k forks source link

Multi-Node Training Timeout Error #688

Open jonghyunL opened 1 month ago

jonghyunL commented 1 month ago

System Info

Env: pytorch 2.5 nightly, CUDA 12.4, python 3.10, NVIDIA Hopper GPU, 2 GPU, NCCL 2.21.5(?)

Information

🐛 Describe the bug

Hi I am trying to run multi-node finetuning of LLAMA, where each GPU reside in the separate VMs (2 VMs in a single machine with one GPU per VM) connected by a bridge network. From hardware research perspective, I only try running single epoch 200steps for testing.

I donot have a great understanding of how multi-node distributed data parallelism work in multi-node setting but I come across this error message in both of my VMs.

I tried changing the timeout limit of torch.distributed.init_process_group(backend="nccl", timeout=timedelta(hours=1)) so that the this exit barrier doesn't get triggered by timeout. I also tried changing barrier timeout point from but that won't work too.

Is there anyone who can help me what this message implies? and How I can solve this?

Error logs

image

Expected behavior

I expected the system to perform all_reduce but it just terminates due to timeout.

HamidShojanazeri commented 1 month ago

@jonghyunL there are several things that could cause but have you tried setting like below on your two VMS

export MASTER_ADDR="192.168.1.1"    # Same as the primary node
export MASTER_PORT=12355            # Same port
export WORLD_SIZE=2                 # Same total number of processes
export RANK=1 
export MASTER_ADDR="192.168.1.1"    # Same as the primary node
export MASTER_PORT=12355            # Same port
export WORLD_SIZE=2                 # Same total number of processes
export RANK=0
jonghyunL commented 1 month ago

Thank you for your reply Hamid. I have tried exporting the env_variables before running them but it still give me this time out error.

I tried setting up environment iwth export NCCL_SOCKET_IFNAME=eno1 but now it gives me

[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1723102898088/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:318, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5 [rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank0]: Last error: [rank0]: Error: network not found.

With NCCL_DEBUG=WARNING it shows this error message. transport/net_socket.cc:46 NCCL WARN NET/Socket : no interface found

Any clue? is there any more settings that I need to provide?

mreso commented 1 month ago

Hi @jonghyunL NCCL_SOCKET_IFNAME will need to be specific to your environment. Is the bridge interface eno1 in the VM? Also, can you check that the master IP is correct? Is that the ip of the bridge? Can you provide the output of ifconfig on the VMs and check if ping between the machines works?

jonghyunL commented 1 month ago

So based on ifconfig, I set my env variable NCCL_SOCKET_IFNAME=enp0s1 Both ping and iperf3 works well between the two VMs.

image image
jonghyunL commented 1 month ago

Previously when I was running, with configuration "torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.20.189.64 --master_port=1234 finetuning.py --dist_checkpoint_root_folder model_checkpoint --dist_checkpoint_folder fine-tuned --model_name meta-llama/Llama-2-7b-hf --output_dir output_llama_7b/ --use_peft --peft_method lora --use_fp16 --max_train_step=10 --batch_size_training=1 --num_epoch=2 --enable_fsdp > llama_7b_native_multi2.out"

Before finishing the first iteration this error comes out.

image

there was some post regarding the same issue https://github.com/NVIDIA/nccl/issues/626. So I added "export NCCL_PROTO=Simple".

Now another timeout error comes out at the first iteration.

image
jonghyunL commented 1 month ago

This is also a screenshot of testing of torch.distributed.all_reduce working under the same environment

image

This is the code used.

image

Kind of really lost in what I should be doing.