Loss does not converge with FSDP cpu offloading

hjlee1371 commented 7 months ago

System Info

pytorch==2.2.0 transformers==4.36.2 8 A100 80GB gpus

Information

[ ] The official example scripts
[X] My own modified scripts

🐛 Describe the bug

When using FSDP cpu offload with --fsdp_config.fsdp_cpu_offload, the loss fails to converge, unlike when offloading is disabled.

My experiments are based on a forked codebase with minimal modification (https://github.com/hjlee1371/llama-recipes) through following scripts. I finetuned llama-2-7b-hf with (default) samsum datset. May related to similar issues such as this

torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py \
  --enable_fsdp \
  --fsdp_config.fsdp_cpu_offload \ # remove this when training without offloading
  --model_name $MODEL_PATH \
  --save_metrics \
  --batch_size_training 4 \
  --gradient_accumulation_steps 16 \
  --batching_strategy padding \
  --flop_counter False \
  --profiler False \
  --output_dir model_logs \ # path to save metrics
  --dist_checkpoint_root_folder model_checkpoints \
  --dist_checkpoint_folder fine-tuned

Error logs

No errors

Expected behavior

Training w/ and w/o cpu offloading should give same results.

HamidShojanazeri commented 7 months ago

thanks @hjlee1371 for brining this to attention, I believe I can somewhat repro your issue .

CPU offload avg_train_prep, Value: 1.0844144423802693 avg_train_loss, Value: 0.08095582574605942

No CPU offload

avg_train_prep, Value: 1.0553292433420818 avg_train_loss, Value: 0.05351858213543892

I am checking with FSDP team on this and will keep you posted.

stgzr commented 2 months ago

I suppose there is a bug in the gradient accumulation implementation. If the model is wrapped by a DistributedDataParallel module, when calling backward, the gradient should be averaged across GPUs. So, it seems the gradient is synchronized during each gradient accumulation step, it is a little weird.

meta-llama / llama-recipes