meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.
12.07k stars 1.93k forks source link

DeepSpeed support for Full Finetuning - FSDP performance is not as good as Deepspeed #536

Open waterluck opened 4 months ago

waterluck commented 4 months ago

🚀 The feature, motivation and pitch

I trained the current code with FSDP to full fine-tune Llama2, it is very quick, but it turns out the performance is even worse than LoRA fine-tuned models using deepspeed ; and llama2-13B is even worse than 7B, which is very strange. Sadly, my code ability is not enough to support me to do that.

More infomation

With hyper-parameter search on batch-size and train epoch, also with other lr-scheduler, the FSDP's eval loss is worse than what is Deepspeed fine-tuned; also on the test-datasets, the performance differs to 5 points on llama2-7B, and larger on llama2-13B. And I don't know the reason.
The deepspeed code I use is based on huggingface-example-run_no-trainer, the code is ok to run for 7B, but not work for llama2-13B, and its very slow.

I found a similar issue which discuss performance FSDP performance bad than Deepspeed as below, it seems it's caused by the model load, please take a look: https://github.com/huggingface/trl/issues/1224

Smu-Tan commented 1 month ago

@waterluck Not sure if it helps, but probably check this .

waterluck commented 1 month ago

@Smu-Tan Thanks for sharing the info! It does make sense.

cailun01 commented 1 week ago

@waterluck Your result is same as mine. When using full fine-tune, FSDP is as quick as DeepSpeed. But when using PEFT, such as lora or qlora, FSDP is slower than DeepSpeed.