Open kevin85421 opened 4 days ago
Interesting. What's the motivation for this?
I think we can technically achieve it by using 0.5 GPU for B and C
when an actor is using the GPU, it must utilize the entire GPU.
This part is a bit tricky. Is it required to "stream" requests? I.e., while C is running, do we want to process a new batch of request on A? I think if we want streaming semantic, this is pretty hard express this (if there's no streaming requirement, we can just send "mock" data to one of B and C and make it no-op)
Interesting. What's the motivation for this?
I am not 100% sure. @BeingGod feel free to correct me if I am wrong.
Instead of using multiple GPU nodes, the default setting for ChatLearn is to run everything on a single GPU node. Therefore, it requires onloading and offloading data before and after each step in online DPO.
Interesting. What's the motivation for this?
I am not 100% sure. @BeingGod feel free to correct me if I am wrong.
Instead of using multiple GPU nodes, the default setting for ChatLearn is to run everything on a single GPU node. Therefore, it requires onloading and offloading data before and after each step in online DPO.
@kevin85421 You are almost right. ChatLearn implement a function named ems
it will share GPU memory resource among different actors but not share compute resources. Which means at same time only have one actor can be executed in GPU card.
More details: https://github.com/alibaba/ChatLearn/blob/main/docs/en/tutorial/ems.md
ah interesting. is it because the workloads you guys run requires less memory but very compute bound?
ah interesting. is it because the workloads you guys run requires less memory but very compute bound?
In Online DPO workflow we have three model (called policy, reward and reference) these models' computation are sequential. Most of the computation are from policy model (it occupied 70-80% E2E time) but the memory occupied are approximate. Obviously reward and reference models occupy a total gpu is not economical due to the lower GPU utilization.
Description
A, B, C, and D are different actors, and all data between these four actors is transferred through NCCL channels.
There is no data dependency between B and C. However, B and C share the same GPU, and when an actor is using the GPU, it must utilize the entire GPU. Therefore, B and C cannot execute simultaneously.
This request is from users who try to test ChatLearn with RayCG
https://github.com/kevin85421/RayCG-ChatLearn/issues/3
Use case
No response