openucx / xccl

Other
22 stars 14 forks source link

Segfault creating process group with 1 member #56

Closed froody closed 4 years ago

froody commented 4 years ago

The xccl backend crashes when using torch_ucc and creating a process group with torch.distributed.new_group([0]). This may be an issue with torch_ucc, but xccl appears in the backtrace so I'm filing the issue here.

Error log: https://gist.github.com/froody/d35d7571b1a8df0638867066d96ecc6c

Relevant error message: [devfair0133:73576:0:73576] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffff00000001)

Steps to reproduce:

  1. Download https://gist.github.com/froody/6286597d33849ff8a108831c31ccd66b to hello_ucx.py
  2. Run TORCH_UCC_COLL_BACKEND=xccl python hello_ucx.py

pytorch version 1.7.0a0+0759809 UCX version 1.9.0 XCCL @ 2e97986fa14ee2538c6ffc577bb75d7434755935 Torch-UCC @ ed0c8dfccf11f73ca60265ce5b6e76220c07f343

Sergei-Lebedev commented 4 years ago

Hi @froody, thanks for bringing this up. I've added fix to torch-ucc, the issue was with oob allgather when group size is 1. But multiple groups support in torch-ucc is not optimized and tested, each time new set of endpoints and xccl contexts will be created instead of just creating new xccl team. Is it important use case for your workloads?

froody commented 4 years ago

Creating groups of size 1 isn't important for my workloads, but creating lots of groups larger than 1 is. Example use case is megatron where 1 group is created per shard in both DDP and tensor-parallelism: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/initialize.py#L68-L87

Sergei-Lebedev commented 4 years ago

I see, thanks. It's definitely worth to add proper support for multiple groups in torch-ucc but it can take a while.