meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
11.62k stars 1.65k forks source link

Loss does not converge with FSDP cpu offloading #360

Open hjlee1371 opened 7 months ago

hjlee1371 commented 7 months ago

System Info

pytorch==2.2.0 transformers==4.36.2 8 A100 80GB gpus

Information

šŸ› Describe the bug

When using FSDP cpu offload with --fsdp_config.fsdp_cpu_offload, the loss fails to converge, unlike when offloading is disabled.

image

My experiments are based on a forked codebase with minimal modification (https://github.com/hjlee1371/llama-recipes) through following scripts. I finetuned llama-2-7b-hf with (default) samsum datset. May related to similar issues such as this

torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py \
  --enable_fsdp \
  --fsdp_config.fsdp_cpu_offload \ # remove this when training without offloading
  --model_name $MODEL_PATH \
  --save_metrics \
  --batch_size_training 4 \
  --gradient_accumulation_steps 16 \
  --batching_strategy padding \
  --flop_counter False \
  --profiler False \
  --output_dir model_logs \ # path to save metrics
  --dist_checkpoint_root_folder model_checkpoints \
  --dist_checkpoint_folder fine-tuned

Error logs

No errors

Expected behavior

Training w/ and w/o cpu offloading should give same results.

HamidShojanazeri commented 7 months ago

thanks @hjlee1371 for brining this to attention, I believe I can somewhat repro your issue .

CPU offload avg_train_prep, Value: 1.0844144423802693 avg_train_loss, Value: 0.08095582574605942

No CPU offload

avg_train_prep, Value: 1.0553292433420818 avg_train_loss, Value: 0.05351858213543892

I am checking with FSDP team on this and will keep you posted.

stgzr commented 2 months ago

I suppose there is a bug in the gradient accumulation implementation. If the model is wrapped by a DistributedDataParallel module, when calling backward, the gradient should be averaged across GPUs. So, it seems the gradient is synchronized during each gradient accumulation step, it is a little weird.