Closed froody closed 4 years ago
Hi @froody, thanks for bringing this up. I've added fix to torch-ucc, the issue was with oob allgather when group size is 1. But multiple groups support in torch-ucc is not optimized and tested, each time new set of endpoints and xccl contexts will be created instead of just creating new xccl team. Is it important use case for your workloads?
Creating groups of size 1 isn't important for my workloads, but creating lots of groups larger than 1 is. Example use case is megatron where 1 group is created per shard in both DDP and tensor-parallelism: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/initialize.py#L68-L87
I see, thanks. It's definitely worth to add proper support for multiple groups in torch-ucc but it can take a while.
The xccl backend crashes when using torch_ucc and creating a process group with
torch.distributed.new_group([0])
. This may be an issue with torch_ucc, but xccl appears in the backtrace so I'm filing the issue here.Error log: https://gist.github.com/froody/d35d7571b1a8df0638867066d96ecc6c
Relevant error message:
[devfair0133:73576:0:73576] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffff00000001)
Steps to reproduce:
TORCH_UCC_COLL_BACKEND=xccl python hello_ucx.py
pytorch version 1.7.0a0+0759809 UCX version 1.9.0 XCCL @ 2e97986fa14ee2538c6ffc577bb75d7434755935 Torch-UCC @ ed0c8dfccf11f73ca60265ce5b6e76220c07f343