pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.9k stars 22.35k forks source link

[Distributed] negative color value passed to comm split #137856

Open kwen2501 opened 7 hours ago

kwen2501 commented 7 hours ago

🐛 Describe the bug

From documentation:

ncclResult_tncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t* newcomm, ncclConfig_t* config)

Ranks which pass the same color value will be part of the same group; color must be a non-negative value. If it is passed as NCCL_SPLIT_NOCOLOR, it means that the rank will not be part of any group.

However, today's code can give negative color to NCCL API.

Repro:

import torch
import os
import torch.distributed as dist

def repro(rank, world_size):
    device=torch.device("cuda", rank)
    dist.init_process_group(
        "nccl",
        rank=rank,
        world_size=world_size,
        device_id=device,
    )
    device_mesh = dist.init_device_mesh(
        "cuda", (2, world_size // 2)
    )
    dist.destroy_process_group()
    print("clean exit")

if __name__ == "__main__":
    repro(int(os.environ["RANK"]), int(os.environ["WORLD_SIZE"]))
TORCH_CPP_LOG_LEVEL=INFO TORCH_NCCL_USE_COMM_NONBLOCKING=1 torchrun --nproc-per-node 4 repro.py

We can see:

[rank2]:   File "/data/users/kw2501/nb_mesh/repro.py", line 13, in repro
[rank2]:     device_mesh = dist.init_device_mesh(
[rank2]:                   ^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 958, in init_device_mesh
[rank2]:     device_mesh = DeviceMesh(
[rank2]:                   ^^^^^^^^^^^
[rank2]:   File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 453, in __init__
[rank2]:     self._init_process_groups()
[rank2]:   File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 556, in _init_process_groups
[rank2]:     dim_group = new_group(
[rank2]:                 ^^^^^^^^^^
[rank2]:   File "/data/users/kw2501/pytorch/torch/distributed/c10d_logger.py", line 97, in wrapper
[rank2]:     func_return = func(*args, **kwargs)
[rank2]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 4675, in new_group
[rank2]:     return _new_group_with_tag(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 4758, in _new_group_with_tag
[rank2]:     pg, pg_store = _new_process_group_helper(
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 1960, in _new_process_group_helper
[rank2]:     eager_backend.eager_connect_single_device(device_id)
[rank2]: RuntimeError: Color must be a non-negative value or NCCL_SPLIT_NOCOLOR (-1), but got -2057847794

Versions

main as of 10132024

kwen2501 commented 6 hours ago

CC: @wz337 @shuqiangzhang