should dynamic scaling and overflow check happen only at the beginning?

stas00 commented 3 years ago

So fp16.initial_scale_power leads to dynamic scaling, except it probably should happen only until it found the right range and never check/go back to scaling again once the right scale has been found.

Observe this:

2021-04-06 21:22:36,418] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
  9% | 16/174 [01:20<13:49,  
{'loss': 3.2588, 'learning_rate': 0, 'epoch': 0.09}                                                                                                               
  9%| | 16/174 [01:20<13:49,  5.25s/it]

  [2021-04-06 21:22:40,973] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
 10%| | 17/174 [01:25<13:11,  
{'loss': 2.5342, 'learning_rate': 0, 'epoch': 0.1}                                                                                                                
{'loss': 3.0586, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                                              
{'loss': 2.8711, 'learning_rate': 1.354634980487915e-06, 'epoch': 0.11}                                                                                           
{'loss': 2.875, 'learning_rate': 2.1470456462384806e-06, 'epoch': 0.11}                                                                                           
{'loss': 3.1064, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.12}     

# XXX: it resumed trying to scale here 2nd time:                                                                                       
 12%| | 21/174 [01:48<14:18,  5.61s/it]
 [2021-04-06 21:23:09,319] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
{'loss': nan, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.13}                                                                                               
 13%| | 22/174 [01:53<13:28,  5.32s/it]
 [2021-04-06 21:23:13,653] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
{'loss': nan, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.13}                                                                                               
 13%|

So the optimizer kicked in on step 17 as there was no more overflow, and then a few steps later the model overflowed because of a totally different reason (blfoat16-pretrained), but the overlow_clean_up kicks back on and tries to scale futher which is pointless since the model is done with - it never recovers.

I mean this doesn't make things worse, it's just confusing to the user that deepspeed is trying to recover from something it can't recover - and it's not deepspeed's fault either.

So my thinking that perhaps once a good scaling factor is reached the check can be stopped?

I hope I was able to convey the issue clearly.

stas00 commented 3 years ago

@samyam and I had a brief discussion about it, Samyam suggested that actually this check should continue through the whole training, because the rescaling might need to happen at a later stage as well, but it should be smart about detecting the point of no return. That is if the initial overflow has been overcome and the optimizer started stepping, then if the overlow is detected again, and it continues for several steps, then stop trying to scale down and logging message and instead just explain that the model is probably "kaput" and not do anything - then the user can clearly see that loss is nan and deepspeed won't be able to do anything about it.

ryusaeba commented 1 year ago

@stas00 Below is the OVERFLOW meesage I got. After first 6 steps is executed, we could see [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=6. I Does this means it is harmless if OVERFLOW message is gone after first few steps is executed? If so, I got antoher two OVERFLOW message in step=500. After that, the log is became [2023-04-28 06:59:46,878] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=8, where the number of skipped is 8 (=6+2). This might be also harmless, right?

One more question: Do you have any idea that the log I got doesn't show the loss?


Beginning of Epoch 1/16, Total Micro Batches 1840
[2023-04-28 06:51:29,903] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2023-04-28 06:51:30,648] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
[2023-04-28 06:51:31,392] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[2023-04-28 06:51:32,150] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
[2023-04-28 06:51:32,906] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
[2023-04-28 06:51:35,672] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
[2023-04-28 06:51:37,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=6, lr=[9.649999560446815e-06, 9.649999560446815e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-04-28 06:51:37,676] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=36.46479187589155, CurrSamplesPerSec=32.71924863905173, MemAllocated=7.15GB, MaxMemAllocated=26.14GB

stas00 commented 1 year ago

When you hit an OVERFLOW under fp16 the optimizer skips this step, so yes it's harmless. Usually you get a few of those at the start while the program tries to find the best scaling factor which prevents gradients from being too small.

It then continues checking for overflows through the rest of the training and if it runs into one again it again adjusts it. This is part of the algorithm, i.e. harmless by your definition.

Do you have any idea that the log I got doesn't show the loss?

I imagine because the loss variable is on the user side, that is you need to print it yourself.

ryusaeba commented 1 year ago

Thanks for the explanation. I just realized it is related to "automatic loss scaling with mixed precision". Regarding the loss question, you are right.

microsoft / DeepSpeed

should dynamic scaling and overflow check happen only at the beginning? #931