ncclResult_tncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t* newcomm, ncclConfig_t* config)
Ranks which pass the same color value will be part of the same group; color must be a non-negative value.
If it is passed as NCCL_SPLIT_NOCOLOR, it means that the rank will not be part of any group.
However, today's code can give negative color to NCCL API.
[rank2]: File "/data/users/kw2501/nb_mesh/repro.py", line 13, in repro
[rank2]: device_mesh = dist.init_device_mesh(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 958, in init_device_mesh
[rank2]: device_mesh = DeviceMesh(
[rank2]: ^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 453, in __init__
[rank2]: self._init_process_groups()
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/device_mesh.py", line 556, in _init_process_groups
[rank2]: dim_group = new_group(
[rank2]: ^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/c10d_logger.py", line 97, in wrapper
[rank2]: func_return = func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 4675, in new_group
[rank2]: return _new_group_with_tag(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 4758, in _new_group_with_tag
[rank2]: pg, pg_store = _new_process_group_helper(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py", line 1960, in _new_process_group_helper
[rank2]: eager_backend.eager_connect_single_device(device_id)
[rank2]: RuntimeError: Color must be a non-negative value or NCCL_SPLIT_NOCOLOR (-1), but got -2057847794
🐛 Describe the bug
From documentation:
However, today's code can give negative color to NCCL API.
Repro:
We can see:
Versions
main as of 10132024