ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.07k stars 5.79k forks source link

[core][compiled graphs] Add CPU-based NCCL communicator for development #47936

Open stephanie-wang opened 1 month ago

stephanie-wang commented 1 month ago

Description

For development and debugging, it's useful to be able to run compiled graphs that contain NCCL transport hints using a CPU-based communicator. The communicator could use the Ray object store / a Ray actor to perform p2p and collective ops.

Use case

No response

Bye-legumes commented 1 month ago

What is the different between this CPU-based NCCL communicator and this mock nccl in test? Or this is not for the test but fall back to CPU/shared memory is NCCL is not avaliable?

dengwxn commented 1 month ago

Assigned to @tfsingh and @anyadontfly. I think this is mainly for tests. For starters, we should work on all-reduce first. Here's a good picture to explain collectives.

stephanie-wang commented 1 month ago

Actually it would be good to make this work for non-testing purposes, so that users can debug DAGs with collective ops on CPU.

tfsingh commented 1 month ago

Commenting for assignment

anyadontfly commented 1 month ago

Commenting for assignment

rkooo567 commented 3 weeks ago

who's going to take this task?

stephanie-wang commented 3 weeks ago

who's going to take this task?

@tfsingh and @anyadontfly are working on this.