Open jonghyunL opened 1 month ago
@jonghyunL there are several things that could cause but have you tried setting like below on your two VMS
export MASTER_ADDR="192.168.1.1" # Same as the primary node
export MASTER_PORT=12355 # Same port
export WORLD_SIZE=2 # Same total number of processes
export RANK=1
export MASTER_ADDR="192.168.1.1" # Same as the primary node
export MASTER_PORT=12355 # Same port
export WORLD_SIZE=2 # Same total number of processes
export RANK=0
Thank you for your reply Hamid. I have tried exporting the env_variables before running them but it still give me this time out error.
I tried setting up environment iwth export NCCL_SOCKET_IFNAME=eno1 but now it gives me
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1723102898088/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:318, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5 [rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank0]: Last error: [rank0]: Error: network not found.
With NCCL_DEBUG=WARNING it shows this error message. transport/net_socket.cc:46 NCCL WARN NET/Socket : no interface found
Any clue? is there any more settings that I need to provide?
Hi @jonghyunL NCCL_SOCKET_IFNAME will need to be specific to your environment. Is the bridge interface eno1 in the VM? Also, can you check that the master IP is correct? Is that the ip of the bridge? Can you provide the output of ifconfig on the VMs and check if ping between the machines works?
So based on ifconfig, I set my env variable NCCL_SOCKET_IFNAME=enp0s1 Both ping and iperf3 works well between the two VMs.
Previously when I was running, with configuration "torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.20.189.64 --master_port=1234 finetuning.py --dist_checkpoint_root_folder model_checkpoint --dist_checkpoint_folder fine-tuned --model_name meta-llama/Llama-2-7b-hf --output_dir output_llama_7b/ --use_peft --peft_method lora --use_fp16 --max_train_step=10 --batch_size_training=1 --num_epoch=2 --enable_fsdp > llama_7b_native_multi2.out"
Before finishing the first iteration this error comes out.
there was some post regarding the same issue https://github.com/NVIDIA/nccl/issues/626. So I added "export NCCL_PROTO=Simple".
Now another timeout error comes out at the first iteration.
This is also a screenshot of testing of torch.distributed.all_reduce working under the same environment
This is the code used.
Kind of really lost in what I should be doing.
System Info
Env: pytorch 2.5 nightly, CUDA 12.4, python 3.10, NVIDIA Hopper GPU, 2 GPU, NCCL 2.21.5(?)
Information
🐛 Describe the bug
Hi I am trying to run multi-node finetuning of LLAMA, where each GPU reside in the separate VMs (2 VMs in a single machine with one GPU per VM) connected by a bridge network. From hardware research perspective, I only try running single epoch 200steps for testing.
I donot have a great understanding of how multi-node distributed data parallelism work in multi-node setting but I come across this error message in both of my VMs.
I tried changing the timeout limit of torch.distributed.init_process_group(backend="nccl", timeout=timedelta(hours=1)) so that the this exit barrier doesn't get triggered by timeout. I also tried changing barrier timeout point from but that won't work too.
Is there anyone who can help me what this message implies? and How I can solve this?
Error logs
Expected behavior
I expected the system to perform all_reduce but it just terminates due to timeout.