Open cadedaniel opened 1 year ago
@cadedaniel Same problem on 2 nodes with 8 GPU on each nodes, here is my NCCL environments:
NCCL_DEBUG: "INFO"
NCCL_SOCKET_IFNAME: "eth0"
NCCL_IB_HCA: "^=mlx5_0"
NCCL_IB_GID_INDEX: "3"
NCCL_IB_DISABLE: "0"
NCCL_IB_TIMEOUT: "25"
NCCL_IB_RETRY_CNT: "7"
collective.allreduce
succeed.
@ray.remote(num_gpus=4)
class Worker:
...
num_workers = 2 # each worker need 4 gpus, so 2 workers on same node workers = [] init_rets = [] for i in range(num_workers): w = Worker.remote() workers.append(w) init_rets.append(w.setup.remote(numworkers, i)) = ray.get(init_rets) results = ray.get([w.compute.remote() for w in workers]) print(results)
2. If 2 actors on different nodes, collective.allreduce` fail.
```python
@ray.remote(num_gpus=8)
class Worker:
...
# imperative
num_workers = 2 # each worker need 8 gpus, so 2 workers on different nodes
workers = []
init_rets = []
for i in range(num_workers):
w = Worker.remote()
workers.append(w)
init_rets.append(w.setup.remote(num_workers, i))
_ = ray.get(init_rets)
results = ray.get([w.compute.remote() for w in workers])
print(results)
Traceback (most recent call last):
File "/opt/tiger/ray/session_2023-08-18_11-56-40_988471_1/runtime_resources/working_dir_files/_ray_pkg_c57d02c4f8b6cd93/ray_job.py", line 40, in <module>
results = ray.get([w.compute.remote() for w in workers])
File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2413, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::Worker.compute() (pid=10926, ip=[fdbd:dc03:9:389::40], repr=<ray_job.Worker object at 0x7faffc0fd5e0>)
File "/opt/tiger/ray/session_2023-08-18_11-56-40_988471_1/runtime_resources/working_dir_files/_ray_pkg_c57d02c4f8b6cd93/ray_job.py", line 23, in compute
collective.allreduce(buffer, "default")
File "/usr/local/lib/python3.9/dist-packages/ray/util/collective/collective.py", line 273, in allreduce
g.allreduce([tensor], opts)
File "/usr/local/lib/python3.9/dist-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 197, in allreduce
self._collective(tensors, tensors, collective_fn)
File "/usr/local/lib/python3.9/dist-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 604, in _collective
comms = self._get_nccl_collective_communicator(key, devices)
File "/usr/local/lib/python3.9/dist-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 451, in _get_nccl_collective_communicator
nccl_util.groupEnd()
File "cupy_backends/cuda/libs/nccl.pyx", line 210, in cupy_backends.cuda.libs.nccl.groupEnd
File "cupy_backends/cuda/libs/nccl.pyx", line 243, in cupy_backends.cuda.libs.nccl.groupEnd
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_INTERNAL_ERROR: internal error
@wuxibin89 this issue is specifically about AWS g5 instance types. Feel free to open a new issue to discuss your problem (make sure to include NCCL debug logs!)
I am seeing a similar issue on a g5.12xlarge
while trying to run a huggingface model.
When I run the p2pBandwidthLatencyTest I am seeing this. The huge bump in latency is not normal.
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3
0 1.67 12.53 12.52 12.43
1 12.36 1.68 12.60 21.01
2 12.35 12.54 1.82 14.79
3 12.60 12.59 12.43 1.80
CPU 0 1 2 3
0 3.13 9.14 8.92 8.87
1 9.08 3.12 8.93 8.84
2 9.09 8.89 3.01 8.98
3 8.87 8.90 8.92 3.01
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3
0 1.67 49207.21 49207.22 49207.24
1 49207.26 1.68 49207.18 49207.20
2 49207.27 49207.26 1.82 49207.28
3 49207.28 49207.27 49207.22 1.79
CPU 0 1 2 3
0 3.11 2.62 2.60 2.72
1 2.85 3.23 2.63 2.71
2 2.75 2.65 3.21 2.61
3 2.62 2.62 2.65 3.51
Here is the connectivity matrix.
# nvidia-smi topo --matrix
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB PHB PHB 0-47 0 N/A
GPU1 PHB X PHB PHB 0-47 0 N/A
GPU2 PHB PHB X PHB 0-47 0 N/A
GPU3 PHB PHB PHB X 0-47 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Using the huggingface accelerate test usility I have found some additional P2P modes that are working.
#working
NCCL_P2P_DISABLE=1 accelerate test
NCCL_P2P_LEVEL=PXB accelerate test
NCCL_P2P_LEVEL=PIX accelerate test
# not working
NCCL_P2P_LEVEL=PHB accelerate test
NCCL_P2P_LEVEL=SYS accelerate test
NCCL_P2P_DISABLE=0 accelerate test
Any other insights would be helpful.
@chadj2 can you share how you run this with Ray ?
I apologize that was not more clear. I am seeing problem with AWS G5 instances that are impacting Ray in addition to huggingface models. The basic diagnostic information I provided gives evidence of the problems that are causing this. There are surprisingly few message boards out there talking about shortcomings in these G5 instances and so new information on this thread will probably help me.
Ray NCCL collectives fail allreduce on multi-GPU aws.G5 nodes because of an issue with how the node exposes topology information. The workaround is to apply
NCCL_P2P_DISABLE=1
, but this negatively impacts performance.Interestingly, the NCCL tests work without Ray, so there is some gap how NCCL works inside Ray collectives vs. outside of Ray. We should fix that.
More context https://ray-distributed.slack.com/archives/CSX7HVB5L/p1694124121110979