microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.83k stars 338 forks source link

A tutorial to help you finetune LLama-2-7b using this repository full of garbarge code with ZeRO2/3 enabled. #430

Open LLMChild opened 1 month ago

LLMChild commented 1 month ago

Setup Environment

Firstly, make sure that everything works well in https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama. This make sure that you have solved all environment issue and you can start to convert the huggingface checkpoint into a zero enabled ckpt.

Checkpoint Conversion

The simplest idea is using the script hf2megads_weight_converter.py and disable pipeline parallel to get a Deepspeed ZeRO Checkpoint. Ah! But it can not be done when you are using this script of https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama. When you are trying to do such a thing, you will get error. https://github.com/microsoft/Megatron-DeepSpeed/blob/3afd267e1e50b1410beb606c5625cc232a55417a/tools/hf2megads_weight_converter.py#L288-L291

Then you may think universal_checkpointing technique may help you to achieve such a conversion. Ah! You wish universal_checkpointing can help you to achive conversion between ZeRO1/2/3 checkpoints with different world size and TP/PP/ZeRO1 checkpoints with different parallel size. But it can not achieve conversion between TP/PP/ZeRO1 and ZeRO2/3. So there is only one way left, to figure out how to achive a ZeRO2/3 checkpoint conversion method based on this script hf2megads_weight_converter.py.

Finetune script

After getting a ZeRO checkpoint, everything else is quite easy. But since this tutorial https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama do not expect you will finetune llama using ZeRO and without pipeline-parallel, there is still a little effort to get there.

Detail modification , please refer to this fix-zero-load. and it should work well.

LLMChild commented 1 month ago

In my case, another problem arises when I specify --untie-embeddings-and-output-weights in the script. The whole program gets stuck in an NCCL all-gather operation. Surprisingly, it gets stuck at a random iteration, making reproduction quite difficult. If you encounter the same situation, try modifying the code in language_model.py to forcefully disable tensor parallel (TP) linear.

image