Open hjlee1371 opened 7 months ago
thanks @hjlee1371 for brining this to attention, I believe I can somewhat repro your issue .
CPU offload avg_train_prep, Value: 1.0844144423802693 avg_train_loss, Value: 0.08095582574605942
avg_train_prep, Value: 1.0553292433420818 avg_train_loss, Value: 0.05351858213543892
I am checking with FSDP team on this and will keep you posted.
I suppose there is a bug in the gradient accumulation implementation.
If the model is wrapped by a DistributedDataParallel
module, when calling backward
, the gradient should be averaged across GPUs. So, it seems the gradient is synchronized during each gradient accumulation step, it is a little weird.
System Info
pytorch==2.2.0 transformers==4.36.2 8 A100 80GB gpus
Information
š Describe the bug
When using FSDP cpu offload with
--fsdp_config.fsdp_cpu_offload
, the loss fails to converge, unlike when offloading is disabled.My experiments are based on a forked codebase with minimal modification (https://github.com/hjlee1371/llama-recipes) through following scripts. I finetuned
llama-2-7b-hf
with (default) samsum datset. May related to similar issues such as thisError logs
No errors
Expected behavior
Training w/ and w/o cpu offloading should give same results.