microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
1.83k stars 338 forks source link

A tutorial to help you finetune LLama-2-7b using this repository full of garbarge code with ZeRO2/3 enabled. #430

Open LLMChild opened 1 month ago

LLMChild commented 1 month ago

Setup Environment

Firstly, make sure that everything works well in This make sure that you have solved all environment issue and you can start to convert the huggingface checkpoint into a zero enabled ckpt.

Checkpoint Conversion

The simplest idea is using the script and disable pipeline parallel to get a Deepspeed ZeRO Checkpoint. Ah! But it can not be done when you are using this script of When you are trying to do such a thing, you will get error.

Then you may think universal_checkpointing technique may help you to achieve such a conversion. Ah! You wish universal_checkpointing can help you to achive conversion between ZeRO1/2/3 checkpoints with different world size and TP/PP/ZeRO1 checkpoints with different parallel size. But it can not achieve conversion between TP/PP/ZeRO1 and ZeRO2/3. So there is only one way left, to figure out how to achive a ZeRO2/3 checkpoint conversion method based on this script

Finetune script

After getting a ZeRO checkpoint, everything else is quite easy. But since this tutorial do not expect you will finetune llama using ZeRO and without pipeline-parallel, there is still a little effort to get there.

Detail modification , please refer to this fix-zero-load. and it should work well.

LLMChild commented 1 month ago

In my case, another problem arises when I specify --untie-embeddings-and-output-weights in the script. The whole program gets stuck in an NCCL all-gather operation. Surprisingly, it gets stuck at a random iteration, making reproduction quite difficult. If you encounter the same situation, try modifying the code in to forcefully disable tensor parallel (TP) linear.
