Closed GonyRosenman closed 3 weeks ago
After further inspection, I discovered that the issue is not directly related to NCCL or DeepSpeed communication. Instead, it has to do with the way I modified trainer.args.label_names. Specifically, I added 'metadata' to label_names to pass a metadata tensor through the collate_fn. This adjustment caused labels to become a tuple of tensors (i.e., (labels, metadata)), which appears to cause an infinite hang when processed by self.gather_function.
I confirmed this behavior with the following minimal reproducible example:
dumm = torch.randint(0, 100, (1, 36)).to('cuda:2')
self.gather_function(((labels, dumm)))
This snippet causes the same infinite hang as in the original issue. It seems the gather_function from the Accelerator class is not designed to handle tuples of tensors, leading to the deadlock.
I’ll need to explore alternatives for passing metadata through the evaluation loop without using label_names to avoid this issue.
Great, thanks for the update!
You saved me! Thank you!
I am encountering an infinite hang during the initial evaluation loop while training a custom LLAVA model using HuggingFace’s Trainer class. This happens only when inserting my own compute_metrics function to the custom trainer class, as outlined below. Notably, the same configuration works, but extremely slowly, if I use the alternative super call (commented in the code).
Steps to Reproduce (High-Level Code)
Logs (Last Few NCCL Calls)
ds_report
nvidia-smi
commands to python -m deepspeed.launcher.launch --world_info='{"127.0.0.1":[0,1,2]}' --master_addr=127.0.0.1 --master_port=4242 --no_local_rank /path/to/llava/train.py --lora_enable True --lora_r 128 --lora_alpha 256 --deepspeed ./scripts/zero3.json --num_train_epochs 1 --gradient_checkpointing True --model_name_or_path liuhaotian/llava-v1.5-13b --output_dir ./checkpoints/llava-v1.5-13b-task-lora --do_eval True Environment Configuration NCCL Debug Output: The hanging occurs during ncclGroupStart() and ncclAllGather operations. Kernel Compatibility: Based on similar issues in the DeepSpeed and NCCL communities, this might relate to GPU communication (e.g., issues with peer-to-peer or collective operations). Troubleshooting Attempts Tried NCCL Workarounds:
export NCCL_P2P_DISABLE=1 export NCCL_LL_THRESHOLD=0 These did not resolve the issue.