[core][compiled graphs] Allow users to specify control dependencies across actors

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.63k stars 5.71k forks source link

[core][compiled graphs] Allow users to specify control dependencies across actors #48237

Open kevin85421 opened 4 days ago

kevin85421 commented 4 days ago

Description

A, B, C, and D are different actors, and all data between these four actors is transferred through NCCL channels.

There is no data dependency between B and C. However, B and C share the same GPU, and when an actor is using the GPU, it must utilize the entire GPU. Therefore, B and C cannot execute simultaneously.

This request is from users who try to test ChatLearn with RayCG

https://github.com/kevin85421/RayCG-ChatLearn/issues/3

Use case

No response

rkooo567 commented 4 days ago

Interesting. What's the motivation for this?

I think we can technically achieve it by using 0.5 GPU for B and C

when an actor is using the GPU, it must utilize the entire GPU.

This part is a bit tricky. Is it required to "stream" requests? I.e., while C is running, do we want to process a new batch of request on A? I think if we want streaming semantic, this is pretty hard express this (if there's no streaming requirement, we can just send "mock" data to one of B and C and make it no-op)

kevin85421 commented 4 days ago

Interesting. What's the motivation for this?

I am not 100% sure. @BeingGod feel free to correct me if I am wrong.

Instead of using multiple GPU nodes, the default setting for ChatLearn is to run everything on a single GPU node. Therefore, it requires onloading and offloading data before and after each step in online DPO.

BeingGod commented 4 days ago

Interesting. What's the motivation for this?

I am not 100% sure. @BeingGod feel free to correct me if I am wrong.

Instead of using multiple GPU nodes, the default setting for ChatLearn is to run everything on a single GPU node. Therefore, it requires onloading and offloading data before and after each step in online DPO.

@kevin85421 You are almost right. ChatLearn implement a function named ems it will share GPU memory resource among different actors but not share compute resources. Which means at same time only have one actor can be executed in GPU card.

More details: https://github.com/alibaba/ChatLearn/blob/main/docs/en/tutorial/ems.md

rkooo567 commented 4 days ago

ah interesting. is it because the workloads you guys run requires less memory but very compute bound?

BeingGod commented 4 days ago

ah interesting. is it because the workloads you guys run requires less memory but very compute bound?

In Online DPO workflow we have three model (called policy, reward and reference) these models' computation are sequential. Most of the computation are from policy model (it occupied 70-80% E2E time) but the memory occupied are approximate. Obviously reward and reference models occupy a total gpu is not economical due to the lower GPU utilization.