Open LLMChild opened 1 month ago
In my case, another problem arises when I specify --untie-embeddings-and-output-weights in the script. The whole program gets stuck in an NCCL all-gather operation. Surprisingly, it gets stuck at a random iteration, making reproduction quite difficult. If you encounter the same situation, try modifying the code in language_model.py to forcefully disable tensor parallel (TP) linear.
Setup Environment
Firstly, make sure that everything works well in
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama
. This make sure that you have solved all environment issue and you can start to convert the huggingface checkpoint into a zero enabled ckpt.Checkpoint Conversion
The simplest idea is using the script hf2megads_weight_converter.py and disable pipeline parallel to get a Deepspeed ZeRO Checkpoint. Ah! But it can not be done when you are using this script of
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama
. When you are trying to do such a thing, you will get error. https://github.com/microsoft/Megatron-DeepSpeed/blob/3afd267e1e50b1410beb606c5625cc232a55417a/tools/hf2megads_weight_converter.py#L288-L291Then you may think universal_checkpointing technique may help you to achieve such a conversion. Ah! You wish universal_checkpointing can help you to achive conversion between ZeRO1/2/3 checkpoints with different world size and TP/PP/ZeRO1 checkpoints with different parallel size. But it can not achieve conversion between TP/PP/ZeRO1 and ZeRO2/3. So there is only one way left, to figure out how to achive a ZeRO2/3 checkpoint conversion method based on this script hf2megads_weight_converter.py.
Finetune script
After getting a ZeRO checkpoint, everything else is quite easy. But since this tutorial
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama
do not expect you will finetune llama using ZeRO and without pipeline-parallel, there is still a little effort to get there.Detail modification , please refer to this fix-zero-load. and it should work well.