NaN error when training the box_generation model

I followed the instructions in the readme file to download all the meta data and used the below command to start training.

python sample.py --is_trainging=1

After probably training tens of thousands of batches, nan errors will appear. Both the accuracy rate and loss will become nan. I have tried this several times, and the problem exists. Does anyone know the reason?

---- below is part of the training log file ---- 2020-12-19 19:32:42,559-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73200/410622 Progress: 1%, Train Total: -10.6086, Perplexity: -9.8591, Bbox Loss: -0.7495 2020-12-19 19:32:44,964-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73250/410622 Progress: 1%, Train Total: -10.5154, Perplexity: -10.0869, Bbox Loss: -0.4285 2020-12-19 19:32:47,430-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73300/410622 Progress: 1%, Train Total: -10.2433, Perplexity: -9.7161, Bbox Loss: -0.5272 2020-12-19 19:32:50,007-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73350/410622 Progress: 1%, Train Total: -9.8285, Perplexity: -9.4005, Bbox Loss: -0.4280 2020-12-19 19:32:52,791-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73400/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan 2020-12-19 19:32:55,854-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73450/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan 2020-12-19 19:32:59,942-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73500/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan

I followed the instructions in the readme file to download all the meta data and used the below command to start training.
python sample.py --is_trainging=1
After probably training tens of thousands of batches, nan errors will appear. Both the accuracy rate and loss will become nan. I have tried this several times, and the problem exists. Does anyone know the reason?

---- below is part of the training log file ---- 2020-12-19 19:32:42,559-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73200/410622 Progress: 1%, Train Total: -10.6086, Perplexity: -9.8591, Bbox Loss: -0.7495 2020-12-19 19:32:44,964-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73250/410622 Progress: 1%, Train Total: -10.5154, Perplexity: -10.0869, Bbox Loss: -0.4285 2020-12-19 19:32:47,430-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73300/410622 Progress: 1%, Train Total: -10.2433, Perplexity: -9.7161, Bbox Loss: -0.5272 2020-12-19 19:32:50,007-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73350/410622 Progress: 1%, Train Total: -9.8285, Perplexity: -9.4005, Bbox Loss: -0.4280 2020-12-19 19:32:52,791-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73400/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan 2020-12-19 19:32:55,854-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73450/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan 2020-12-19 19:32:59,942-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73500/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan

I solved this problem by setting the learning rate of Adam to 1e-4. It seems that the NaN error is caused by the large learning rate in the gradient backpropagation step.

Thx for sharing your exp. I encounter NAN after 158k step 44,800 seq2seq.trainer.supervised_trainer INFO 157750/410622 Progress: 3%, Train Total: -13.4886, Perplexity: -12.4912, Bbox Loss: -0.9974 l_accuracy: 0.42857142857142855 step: 157800 total step: 4106220 2022-03-17 11:16:47,819 seq2seq.trainer.supervised_trainer INFO 157800/410622 Progress: 3%, Train Total: -14.1226, Perplexity: -13.0229, Bbox Loss: -1.0998 l_accuracy: 0.5 step: 157850 total step: 4106220 2022-03-17 11:16:52,111 seq2seq.trainer.supervised_trainer INFO 157850/410622 Progress: 3%, Train Total: -12.7759, Perplexity: -11.7502, Bbox Loss: -1.0257 l_accuracy: 0.5 step: 157900 total step: 4106220 2022-03-17 11:16:55,737 seq2seq.trainer.supervised_trainer INFO 157900/410622 Progress: 3%, Train Total: -14.2770, Perplexity: -13.5714, Bbox Loss: -0.7056 l_accuracy: 0.5 step: 157950 total step: 4106220 2022-03-17 11:16:59,125 seq2seq.trainer.supervised_trainer INFO 157950/410622 Progress: 3%, Train Total: -14.6622, Perplexity: -13.4238, Bbox Loss: -1.2384 l_accuracy: 0.6666666666666666 step: 158000 total step: 4106220 2022-03-17 11:17:02,468 seq2seq.trainer.supervised_trainer INFO 158000/410622 Progress: 3%, Train Total: -12.4783, Perplexity: -11.8630, Bbox Loss: -0.6154 l_accuracy: 0.3333333333333333 step: 158050 total step: 4106220 2022-03-17 11:17:05,791 seq2seq.trainer.supervised_trainer INFO 158050/410622 Progress: 3%, Train Total: -10.3930, Perplexity: -9.7487, Bbox Loss: -0.6442 l_accuracy: 0.14285714285714285 step: 158100 total step: 4106220 2022-03-17 11:17:09,740 seq2seq.trainer.supervised_trainer INFO 158100/410622 Progress: 3%, Train Total: -11.3649, Perplexity: -11.0917, Bbox Loss: -0.2732 l_accuracy: 0.875 step: 158150 total step: 4106220 2022-03-17 11:17:13,338 seq2seq.trainer.supervised_trainer INFO 158150/410622 Progress: 3%, Train Total: -13.4249, Perplexity: -12.4471, Bbox Loss: -0.9778 l_accuracy: 0.5454545454545454 step: 158200 total step: 4106220 2022-03-17 11:17:16,410 seq2seq.trainer.supervised_trainer INFO 158200/410622 Progress: 3%, Train Total: -13.5133, Perplexity: -12.6634, Bbox Loss: -0.8499 l_accuracy: 0.5 step: 158250 total step: 4106220 2022-03-17 11:17:19,989 seq2seq.trainer.supervised_trainer INFO 158250/410622 Progress: 3%, Train Total: -13.7298, Perplexity: -13.7187, Bbox Loss: -0.0110 l_accuracy: 0.0 step: 158300 total step: 4106220 2022-03-17 11:17:23,242 seq2seq.trainer.supervised_trainer INFO 158300/410622 Progress: 3%, Train Total: nan, Perplexity: nan, Bbox Loss: nan l_accuracy: 0.0 step: 158350 total step: 4106220 2022-03-17 11:17:27,110 seq2seq.trainer.supervised_trainer INFO 158350/410622 Progress: 3%, Train Total: nan, Perplexity: nan, Bbox Loss: nan l_accuracy: 0.0 step: 158400 total step: 4106220 2022-03-17 11:17:30,312 seq2seq.trainer.supervised_trainer INFO 158400/410622 Progress: 3%, Train Total: nan, Perplexity: nan, Bbox Loss: nan l_accuracy: 0.0 step: 158450 total step: 4106220 2022-03-17 11:17:33,319 seq2seq.trainer.supervised_trainer INFO 158450/410622 Progress: 3%, Train Total: nan, Perplexity: nan, Bbox Loss: nan l_accuracy: 0.0 step: 158500 total step: 4106220 2022-03-17 11:17:36,220 seq2seq.trainer.supervised_trainer INFO 158500/410622 Progress: 3%, Train Total: nan, I will see whether decay learning rate can help

jamesli1618 / Obj-GAN

NaN error when training the box_generation model #27