[REQUEST] Upstream modifications of PaRO

Is your feature request related to a problem? Please describe.

We find that the current distributed strategies of ZeRO is limited in the heterogeneous networks where the intra-inter node network differs. The network performance gap can significantly slow down the collective communication operations used in ZeRO. Although ZeRO++ and MiCS further optimize the communication cost, the solutions remain constrained. Different distributed strategies are necessary when trainable parameters differ. For example, in PEFT scenarios, it is more advisable to partition parameters in a more fine-grained manner than gradients, as parameters occupy most of the memory footprint. Therefore, employing a strategy that group partitions parameters without partitioning gradients, and globally partitions optimizer states, is considerably more efficient than the ZeRO or MiCS strategies in such cases.

Describe the solution you'd like

Our team is planning to open-source our implementation of PaRO (https://arxiv.org/pdf/2310.06003) and contribute it to the codebase. Specifically, we define three partitioning states: no partitioning (N), intra-group partitioning (I), and global partitioning (G) from coarse-grained to fine-grained, which act on three components of model states: Parameter (p), Gradient (g), Optimizer state (os). The above three levels of partitioning granularity on p, g, and os bring up 27 combinations of model partitioning strategies, while not all strategies are effective. However, through rigorous analysis, we have identified and selected 14 effective strategies, referred to as PaRO-DP. These strategies optimize the trade-offs between memory and communication costs in diverse training scenarios involving both full-parameter training, partial-parameter training, and PEFT. The PaRO-DP strategies can improve training speeds by up to 266% over ZeRO in environments where inter-group communication is poor. We showcase parts of the experimental results in the following figures, with detailed illustrations provided in the paper. Notably, ZeRO and MiCS can be considered specific cases within our PaRO-DP strategy set, where ZeRO 1 corresponds to PaRO-NNG and MiCS corresponds to PaRO-III. Additionally, our proposal of a tailored distributed training strategy for PEFT constitutes a new attempt within the field to our knowledge.

Describe alternatives you've considered

N/A

Additional context

Thanks for sharing the ZeRO with the community! It has advanced the development of distributed training and has become an essential infrastructure for LLM training with minimal code modifications. Recently, we have made some improvements to enhance training performance further while maintaining the simplicity of usage. These improvements have already been implemented in the LLM training at Ant Group. To share our efforts with the community, we look forward to integrating our implementation into the DeepSpeed master branch.

We have created this issue to discuss the proposed modifications to facilitate discussions and minimize potential conflicts of interest. We are excited to work with the community on this matter.

We list tentative modifications of the code changes for discussion:

We plan to place most of the “PaRO” specific implementations into deepspeed/runtime/zero/paro_*.py files. Our implementations will inherit from both classes DeepSpeedZeroOptimizerand DeepSpeedZeroOptimizer_Stage3 to reuse existing code logic.
Minimal modifications will be required in deepspeed/runtime/engine.py to include the above optimizers.
Our implementations will need to add the following user interfaces to the ds_config.json file.

{
  "zero_optimization": {
    "paro_strategy": "NIG"
  }
}

Please feel free to comment here or reach out via email (youshao.xys@antgroup.com).

Many Thanks, Kris Xiao

microsoft / DeepSpeed