[REQUEST] More fine-grained distributed strategies for RLHF training

Is your feature request related to a problem? Please describe. We find that the generation stage of RLHF pipeline is time-consuming during the current training process. This is because the four models (Actor, Critic, Reward, and Ref) are all colocated on the same devices, utilizing a "Flattening" strategy. This results in that both training and inference runtime are mixed in the current procedure. It disables the training or inference specialized optimization methods. Also, a significant amount of memory is occupied by models, but they are idle in generation stage of actor model. Therefore, instead of collocating these four models on all devices, more fine-grained placement strategy could be utilized.

Describe the solution you'd like Our team is planning to open-source our implementation of APP (https://arxiv.org/pdf/2312.11819.pdf) and contribute it to the codebase. Specifically, we are proposing two fine-grained model placement strategies: A Separation strategy that separates the training and inference runtime of the RLHF pipeline with additional shadow models. This enables the adoption of inference-optimized techniques such as vLLM and intra-node tensor parallelism to accelerate the time-cost generation stage. This enables different distributed stragies during the generation stage compared with training stage. An Interleaving strategy that helps reduce memory redundancy and communication costs in RLHF training by placing models without dependencies on exclusive devices with careful orchestration. For example, inference models like the reward model and reference model could be placed on separate devices. This approach enables the reduction of memory redundancy using the DDP or ZeRO 1-2 by decreasing the scale of participating nodes.

Describe alternatives you've considered N/A

Additional context Thank you for sharing the deepspeed-chat with the community! It has been an essential infrastructure, providing an easy-to-use solution for training InstructGPT-like models. Recently, we have made some improvements to further enhance training performance while maintaining the simplicity of usage. These improvements have already been implemented in the RLHF training at Ant Group. In order to share our efforts with the deepspeed-chat community, we would like to integrate our implementation into DeepSpeedExamples codebase.

To facilitate discussions and minimize potential conflicts of interest, we have created this issue to engage in conversations about the proposed modifications. We look forward to collaborating with the community on this matter.

Please feel free to comment here or reach out via email (youshao.xys@antgroup.com). Thanks!

microsoft / DeepSpeedExamples

[REQUEST] More fine-grained distributed strategies for RLHF training #884