Open waterluck opened 5 months ago
@Smu-Tan Thanks for sharing the info! It does make sense.
@waterluck Your result is same as mine. When using full fine-tune, FSDP is as quick as DeepSpeed. But when using PEFT, such as lora or qlora, FSDP is slower than DeepSpeed.
🚀 The feature, motivation and pitch
I trained the current code with FSDP to full fine-tune Llama2, it is very quick, but it turns out the performance is even worse than LoRA fine-tuned models using deepspeed ; and llama2-13B is even worse than 7B, which is very strange. Sadly, my code ability is not enough to support me to do that.
More infomation
With hyper-parameter search on batch-size and train epoch, also with other lr-scheduler, the FSDP's eval loss is worse than what is Deepspeed fine-tuned; also on the test-datasets, the performance differs to 5 points on llama2-7B, and larger on llama2-13B. And I don't know the reason.
The deepspeed code I use is based on huggingface-example-run_no-trainer, the code is ok to run for 7B, but not work for llama2-13B, and its very slow.
I found a similar issue which discuss performance FSDP performance bad than Deepspeed as below, it seems it's caused by the model load, please take a look: https://github.com/huggingface/trl/issues/1224