Closed ccruttjr closed 9 months ago
cc @pacman100
From what I recall, if there is overflow, the mini batch is skipped, then the scaling factor is altered. Until the loss scaling gets stable, the iteration times will vary. Here is an example from my training log. Until the loss scaling is stable, the iteration time is less (which I believe is because of dropping the mini batch)
On an another note, I have seen wide variations between iterations when using deepspeed zero 3 offload. (Mainly when it throws the memory pressure warning)
ok thank you
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I am running some fine-tuning via Accelerate+DeepSpeed. DS progressively tries to decrease time per iteration, which it does. This is great... until it reaches a certain point and it starts increasing the time per iteration. I wanted to know if there was a way to stop reducing loss scale after a certain point (such as iteration time going back up) as well as the best way to go about using
get_accelerator().empty_cache()
.Here is some of my code followed by my accelerator yaml and how I run it.
ds.yaml
and here are the logs
Expected behavior
Not increase iteration time upon loss scaling