Overflow in GPT examples

🐛 Describe the bug

I met overflow using the official scripts for GPT2. Is that a normal case?

cd XXX/ColossalAI/examples/language/gpt
export DATA=/data/scratch/gpt_data/small-gpt-dataset.json
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

[Epoch 0 / Train]: 0%| | 1/8614 [00:00<1:03:35, 2.26it/s, loss=265.25, lr=2.5e-05, throughput=4.5244][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0 [Epoch 0 / Train]: 0%| | 2/8614 [00:00<1:00:07, 2.39it/s, loss=nan, lr=2.5e-05, throughput=4.9813][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0 [Epoch 0 / Train]: 0%| | 3/8614 [00:01<56:35, 2.54it/s, loss=nan, lr=2.5e-05, throughput=5.4833][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0 [Epoch 0 / Train]: 0%| | 4/8614 [00:01<55:32, 2.58it/s, loss=nan, lr=2.5e-05, throughput=5.3257][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0 [Epoch 0 / Train]: 0%| | 5/8614 [00:01<54:26, 2.64it/s, loss=nan, lr=2.5e-05, throughput=5.473][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0 [Epoch 0 / Train]: 0%| | 6/8614 [00:02<53:34, 2.68it/s, loss=nan, lr=2.5e-05, throughput=5.5342][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0 [Epoch 0 / Train]: 0%| | 7/8614 [00:02<53:14, 2.69it/s, loss=nan, lr=2.5e-05, throughput=5.4624][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0 [Epoch 0 / Train]: 0%| | 8/8614 [00:03<52:47, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.5429][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0 [Epoch 0 / Train]: 0%| | 9/8614 [00:03<52:41, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.4693][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0 [Epoch 0 / Train]: 0%| | 10/8614 [00:03<52:14, 2.74it/s, loss=nan, lr=2.5e-05, throughput=5.6025][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0 [Epoch 0 / Train]: 0%| | 11/8614 [00:04<51:50, 2.77it/s, loss=nan, lr=2.5e-05, throughput=5.6395][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0 [Epoch 0 / Train]: 0%|▏ | 12/8614 [00:04<51:27, 2.79it/s, loss=nan, lr=2.5e-05, throughput=5.6746][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0 [Epoch 0 / Train]: 0%|▏ | 13/8614 [00:04<51:15, 2.80it/s, loss=nan, lr=2.5e-05, throughput=5.6452][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0 [Epoch 0 / Train]: 0%|▏ | 14/8614 [00:05<50:58, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.7043][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0 [Epoch 0 / Train]: 0%|▏ | 15/8614 [00:05<50:56, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.6454][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0 [Epoch 0 / Train]: 0%|▏ | 16/8614 [00:05<50:48, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.678][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0 [Epoch 0 / Train]: 0%|▏ | 17/8614 [00:06<50:38, 2.83it/s, loss=nan, lr=2.5e-05, throughput=5.7112][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0

Environment

ffmpeg 4.3 hf484d3e_0 pytorch pytorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 pytorch pytorch-mutex 1.0 cuda pytorch torchaudio 0.10.2 py39_cu113 pytorch torchvision 0.11.3 py39_cu113 pytorch

hpcaitech / ColossalAI-Examples