Open 450586509 opened 9 months ago
ss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2 [2024-01-17 10:10:24,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=16, lr=[0.0001], mom=[(0.9, 0.95)] [2024-01-17 10:10:24,478] [INFO] [timer.py:260:stop] epoch=0/micro_step=64/global_step=16, RunningAvgSamplesPerSec=7.620608764200451, CurrSamplesPerSec=7.801349699356398, MemAllocated=13.44GB, MaxMemAllocated=14.82GB 0%| | 67/114599 [00:09<4:29:54, 7.07batch/s][2024-01-17 10:10:25,080] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1 [2024-01-17 10:10:25,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=17, lr=[0.0001], mom=[(0.9, 0.95)] [2024-01-17 10:10:25,081] [INFO] [timer.py:260:stop] epoch=0/micro_step=68/global_step=17, RunningAvgSamplesPerSec=7.547169766053195, CurrSamplesPerSec=6.64997792221485, MemAllocated=13.44GB, MaxMemAllocated=14.82GB 0%| | 71/114599 [00:10<4:44:42, 6.70batch/s] Traceback (most recent call last): File "/root/ChatGLM-Finetuning/train.py", line 234, in main() File "/root/ChatGLM-Finetuning/train.py", line 195, in main model.step() File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2148, in step self._take_model_step(lr_kwargs) File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2054, in _take_model_step self.optimizer.step() File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1778, in step self._update_scale(self.overflow) File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2029, in _update_scale self.loss_scaler.update_scale(has_overflow) File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale raise Exception( Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. [2024-01-17 10:10:28,152] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1997
你好,请问解决了吗
ss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2 [2024-01-17 10:10:24,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=16, lr=[0.0001], mom=[(0.9, 0.95)] [2024-01-17 10:10:24,478] [INFO] [timer.py:260:stop] epoch=0/micro_step=64/global_step=16, RunningAvgSamplesPerSec=7.620608764200451, CurrSamplesPerSec=7.801349699356398, MemAllocated=13.44GB, MaxMemAllocated=14.82GB 0%| | 67/114599 [00:09<4:29:54, 7.07batch/s][2024-01-17 10:10:25,080] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1 [2024-01-17 10:10:25,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=17, lr=[0.0001], mom=[(0.9, 0.95)] [2024-01-17 10:10:25,081] [INFO] [timer.py:260:stop] epoch=0/micro_step=68/global_step=17, RunningAvgSamplesPerSec=7.547169766053195, CurrSamplesPerSec=6.64997792221485, MemAllocated=13.44GB, MaxMemAllocated=14.82GB 0%| | 71/114599 [00:10<4:44:42, 6.70batch/s] Traceback (most recent call last): File "/root/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/root/ChatGLM-Finetuning/train.py", line 195, in main
model.step()
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2148, in step
self._take_model_step(lr_kwargs)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2054, in _take_model_step
self.optimizer.step()
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1778, in step
self._update_scale(self.overflow)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2029, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
[2024-01-17 10:10:28,152] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1997