Open hijkzzz opened 1 month ago
@njhill Do you have any insights? Thanks.
@hijkzzz I don't have any immediate insight. I can take a closer look but can't promise how soon.
We could also consider adding a flag to disable the behaviour introduced in #4894, in particular to have the remote worker "loop" always exit after a single iteration. There would be a performance downside to that but it may help with cases like yours.
actually, I'm quite surprised that it worked previously. vLLM should take control over all distributed initialization and destruction. How can you add another process into the group?
actually, I'm quite surprised that it worked previously. vLLM should take control over all distributed initialization and destruction. How can you add another process into the group?
We hacked the init_process_group
API and created a new group for vLLM engines and rank0 of DeepSpeed.
See here: https://github.com/OpenLLMAI/OpenRLHF/blob/188139f809d9d14a8b1d8210f9e6746e2422e4e0/openrlhf/utils/distributed_util.py#L20
and
https://github.com/OpenLLMAI/OpenRLHF/blob/188139f809d9d14a8b1d8210f9e6746e2422e4e0/openrlhf/trainer/ray/ppo_actor.py#L89
Thanks
This is quite hacky. If possible, I suggest sharing cuda tensors across process, e.g. if vLLM has TP processes, and your DeepSpeed process group also has TP processes, they can share cuda tensor without copying around. It requires the two groups own the same set of tensors though.
This is quite hacky. If possible, I suggest sharing cuda tensors across process, e.g. if vLLM has TP processes, and your DeepSpeed process group also has TP processes, they can share cuda tensor without copying around. It requires the two groups own the same set of tensors though.
This cannot meet the requirements for multi-machine distributed training in RLHF.
Your current environment
We are working on accelerating RLHF algorithms and need to broadcast the weights of the DeepSpeed engine to the vLLM Ray worker. In v0.4.2, we were able to create an additional NCCL group to achieve this. However, after updating to v0.4.3 and incorporating the changes from this MR, we found that doing so causes NCCL errors during broadcast.
Our weight synchronization code is located at: https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/vllm_engine.py. and https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/vllm_worker_wrap.py
see
init_process_group
(build NCCL group between vLLM and DeepSpeed namedself._model_update_group
)and
update_weight
(Broadcast weights from DeepSpeed to vLLM,torch.distributed.broadcast(weight, 0, group=self._model_update_group)
)We temporarily replaced the NCCL backend with GLOO to make it work, but the performance was poor。
The error message is:
Even call
self.llm.llm_engine.model_executor.stop_remote_worker_execution_loop()
before broadcast, there will still be one other NCCL error.I think our code
torch.distributed.broadcast(weight, 0, group=self._model_update_group)
may be conflicts with this this MR. btw, I'm not sure how to fix it.