ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.97k stars 5.77k forks source link

[Ray GPU collectives] NCCL internal error on aws.G5 node #39471

Open cadedaniel opened 1 year ago

cadedaniel commented 1 year ago

Ray NCCL collectives fail allreduce on multi-GPU aws.G5 nodes because of an issue with how the node exposes topology information. The workaround is to apply NCCL_P2P_DISABLE=1, but this negatively impacts performance.

Interestingly, the NCCL tests work without Ray, so there is some gap how NCCL works inside Ray collectives vs. outside of Ray. We should fix that.

More context https://ray-distributed.slack.com/archives/CSX7HVB5L/p1694124121110979

  File "/home/ml/ray-play/src/inference/allreduce_test.py", line 39, in <module>
    results = ray.get([w.compute.remote() for w in workers])
  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/worker.py", line 2493, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::Worker.compute() (pid=35381, ip=10.223.22.25, actor_id=b84905374d259cd0728226ab01000000, repr=<allreduce_test.Worker object at 0x7f496f95ab90>)
  File "/home/ml/ray-play/src/inference/allreduce_test.py", line 21, in compute
    collective.allreduce(self.send, "default")
  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/util/collective/collective.py", line 273, in allreduce
    g.allreduce([tensor], opts)
  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 197, in allreduce
    self._collective(tensors, tensors, collective_fn)
  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 604, in _collective
    comms = self._get_nccl_collective_communicator(key, devices)
  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 451, in _get_nccl_collective_communicator
    nccl_util.groupEnd()
  File "cupy_backends/cuda/libs/nccl.pyx", line 210, in cupy_backends.cuda.libs.nccl.groupEnd
  File "cupy_backends/cuda/libs/nccl.pyx", line 243, in cupy_backends.cuda.libs.nccl.groupEnd
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_INTERNAL_ERROR: internal error
wuxibin89 commented 1 year ago

@cadedaniel Same problem on 2 nodes with 8 GPU on each nodes, here is my NCCL environments:

NCCL_DEBUG: "INFO"
NCCL_SOCKET_IFNAME: "eth0"
NCCL_IB_HCA: "^=mlx5_0"
NCCL_IB_GID_INDEX: "3"
NCCL_IB_DISABLE: "0"
NCCL_IB_TIMEOUT: "25"
NCCL_IB_RETRY_CNT: "7"
  1. If 2 actors on same node, collective.allreduce succeed.
    
    @ray.remote(num_gpus=4)
    class Worker:
    ...

imperative

num_workers = 2 # each worker need 4 gpus, so 2 workers on same node workers = [] init_rets = [] for i in range(num_workers): w = Worker.remote() workers.append(w) init_rets.append(w.setup.remote(numworkers, i)) = ray.get(init_rets) results = ray.get([w.compute.remote() for w in workers]) print(results)

2. If 2 actors on different nodes, collective.allreduce` fail.
```python
@ray.remote(num_gpus=8)
class Worker:
    ...

# imperative
num_workers = 2 # each worker need 8 gpus, so 2 workers on different nodes
workers = []
init_rets = []
for i in range(num_workers):
    w = Worker.remote()
    workers.append(w)
    init_rets.append(w.setup.remote(num_workers, i))
_ = ray.get(init_rets)
results = ray.get([w.compute.remote() for w in workers])
print(results)
Traceback (most recent call last):
  File "/opt/tiger/ray/session_2023-08-18_11-56-40_988471_1/runtime_resources/working_dir_files/_ray_pkg_c57d02c4f8b6cd93/ray_job.py", line 40, in <module>
    results = ray.get([w.compute.remote() for w in workers])
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2413, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::Worker.compute() (pid=10926, ip=[fdbd:dc03:9:389::40], repr=<ray_job.Worker object at 0x7faffc0fd5e0>)
  File "/opt/tiger/ray/session_2023-08-18_11-56-40_988471_1/runtime_resources/working_dir_files/_ray_pkg_c57d02c4f8b6cd93/ray_job.py", line 23, in compute
    collective.allreduce(buffer, "default")
  File "/usr/local/lib/python3.9/dist-packages/ray/util/collective/collective.py", line 273, in allreduce
    g.allreduce([tensor], opts)
  File "/usr/local/lib/python3.9/dist-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 197, in allreduce
    self._collective(tensors, tensors, collective_fn)
  File "/usr/local/lib/python3.9/dist-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 604, in _collective
    comms = self._get_nccl_collective_communicator(key, devices)
  File "/usr/local/lib/python3.9/dist-packages/ray/util/collective/collective_group/nccl_collective_group.py", line 451, in _get_nccl_collective_communicator
    nccl_util.groupEnd()
  File "cupy_backends/cuda/libs/nccl.pyx", line 210, in cupy_backends.cuda.libs.nccl.groupEnd
  File "cupy_backends/cuda/libs/nccl.pyx", line 243, in cupy_backends.cuda.libs.nccl.groupEnd
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_INTERNAL_ERROR: internal error
cadedaniel commented 1 year ago

@wuxibin89 this issue is specifically about AWS g5 instance types. Feel free to open a new issue to discuss your problem (make sure to include NCCL debug logs!)

chadj2 commented 3 months ago

I am seeing a similar issue on a g5.12xlarge while trying to run a huggingface model.

When I run the p2pBandwidthLatencyTest I am seeing this. The huge bump in latency is not normal.

P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3 
     0   1.67  12.53  12.52  12.43 
     1  12.36   1.68  12.60  21.01 
     2  12.35  12.54   1.82  14.79 
     3  12.60  12.59  12.43   1.80 

   CPU     0      1      2      3 
     0   3.13   9.14   8.92   8.87 
     1   9.08   3.12   8.93   8.84 
     2   9.09   8.89   3.01   8.98 
     3   8.87   8.90   8.92   3.01 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3 
     0   1.67 49207.21 49207.22 49207.24 
     1 49207.26   1.68 49207.18 49207.20 
     2 49207.27 49207.26   1.82 49207.28 
     3 49207.28 49207.27 49207.22   1.79 

   CPU     0      1      2      3 
     0   3.11   2.62   2.60   2.72 
     1   2.85   3.23   2.63   2.71 
     2   2.75   2.65   3.21   2.61 
     3   2.62   2.62   2.65   3.51 

Here is the connectivity matrix.

# nvidia-smi topo --matrix
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-47    0               N/A
GPU1    PHB      X      PHB     PHB     0-47    0               N/A
GPU2    PHB     PHB      X      PHB     0-47    0               N/A
GPU3    PHB     PHB     PHB      X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Using the huggingface accelerate test usility I have found some additional P2P modes that are working.

#working
NCCL_P2P_DISABLE=1 accelerate test
NCCL_P2P_LEVEL=PXB accelerate test
NCCL_P2P_LEVEL=PIX accelerate test

# not working
NCCL_P2P_LEVEL=PHB accelerate test
NCCL_P2P_LEVEL=SYS accelerate test
NCCL_P2P_DISABLE=0 accelerate test

Any other insights would be helpful.

cadedaniel commented 3 months ago

@chadj2 can you share how you run this with Ray ?

chadj2 commented 3 months ago

I apologize that was not more clear. I am seeing problem with AWS G5 instances that are impacting Ray in addition to huggingface models. The basic diagnostic information I provided gives evidence of the problems that are causing this. There are surprisingly few message boards out there talking about shortcomings in these G5 instances and so new information on this thread will probably help me.