microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.83k stars 338 forks source link

[checkpoint conversion] meg-ds to meg-ds topology reshaping #23

Open stas00 opened 2 years ago

stas00 commented 2 years ago

Feature request

Similar to https://github.com/microsoft/Megatron-DeepSpeed/tree/main/tools/convert_checkpoint

deepspeed_to_megatron.py --target_tp TARGET_TP --target_pp TARGET_PP [...]

where the checkpoint can be reshaped for a different TP/PP target for Megatron-Deepspeed to Megatron-LM, we need the same for Megatron-Deepspeed to Megatron-Deepspeed. i.e. currently it is not possible to change the TP topology once the training started.

So the desired API is:

deepspeed_to_deepspeed.py --target_tp TARGET_TP --target_pp TARGET_PP [...]

Critical new need: the optimizer states need to be reshaped as well

Thank you!

@tjruwase

Vincentwei1021 commented 1 year ago

It seems that https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/238 has completed the work. When will it be backported to this repo?