Open muziyongshixin opened 3 years ago
I followed the instructions in the readme file to download all the meta data and used the below command to start training.
python sample.py --is_trainging=1
After probably training tens of thousands of batches, nan errors will appear. Both the accuracy rate and loss will become nan. I have tried this several times, and the problem exists. Does anyone know the reason?
---- below is part of the training log file ---- 2020-12-19 19:32:42,559-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73200/410622 Progress: 1%, Train Total: -10.6086, Perplexity: -9.8591, Bbox Loss: -0.7495 2020-12-19 19:32:44,964-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73250/410622 Progress: 1%, Train Total: -10.5154, Perplexity: -10.0869, Bbox Loss: -0.4285 2020-12-19 19:32:47,430-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73300/410622 Progress: 1%, Train Total: -10.2433, Perplexity: -9.7161, Bbox Loss: -0.5272 2020-12-19 19:32:50,007-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73350/410622 Progress: 1%, Train Total: -9.8285, Perplexity: -9.4005, Bbox Loss: -0.4280 2020-12-19 19:32:52,791-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73400/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan 2020-12-19 19:32:55,854-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73450/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan 2020-12-19 19:32:59,942-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73500/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan
I solved this problem by setting the learning rate of Adam to 1e-4. It seems that the NaN error is caused by the large learning rate in the gradient backpropagation step.
Thx for sharing your exp. I encounter NAN after 158k step
44,800 seq2seq.trainer.supervised_trainer INFO 157750/410622 Progress: 3%, Train Total: -13.4886, Perplexity: -12.4912, Bbox Loss: -0.9974 l_accuracy: 0.42857142857142855 step: 157800 total step: 4106220 2022-03-17 11:16:47,819 seq2seq.trainer.supervised_trainer INFO 157800/410622 Progress: 3%, Train Total: -14.1226, Perplexity: -13.0229, Bbox Loss: -1.0998 l_accuracy: 0.5 step: 157850 total step: 4106220 2022-03-17 11:16:52,111 seq2seq.trainer.supervised_trainer INFO 157850/410622 Progress: 3%, Train Total: -12.7759, Perplexity: -11.7502, Bbox Loss: -1.0257 l_accuracy: 0.5 step: 157900 total step: 4106220 2022-03-17 11:16:55,737 seq2seq.trainer.supervised_trainer INFO 157900/410622 Progress: 3%, Train Total: -14.2770, Perplexity: -13.5714, Bbox Loss: -0.7056 l_accuracy: 0.5 step: 157950 total step: 4106220 2022-03-17 11:16:59,125 seq2seq.trainer.supervised_trainer INFO 157950/410622 Progress: 3%, Train Total: -14.6622, Perplexity: -13.4238, Bbox Loss: -1.2384 l_accuracy: 0.6666666666666666 step: 158000 total step: 4106220 2022-03-17 11:17:02,468 seq2seq.trainer.supervised_trainer INFO 158000/410622 Progress: 3%, Train Total: -12.4783, Perplexity: -11.8630, Bbox Loss: -0.6154 l_accuracy: 0.3333333333333333 step: 158050 total step: 4106220 2022-03-17 11:17:05,791 seq2seq.trainer.supervised_trainer INFO 158050/410622 Progress: 3%, Train Total: -10.3930, Perplexity: -9.7487, Bbox Loss: -0.6442 l_accuracy: 0.14285714285714285 step: 158100 total step: 4106220 2022-03-17 11:17:09,740 seq2seq.trainer.supervised_trainer INFO 158100/410622 Progress: 3%, Train Total: -11.3649, Perplexity: -11.0917, Bbox Loss: -0.2732 l_accuracy: 0.875 step: 158150 total step: 4106220 2022-03-17 11:17:13,338 seq2seq.trainer.supervised_trainer INFO 158150/410622 Progress: 3%, Train Total: -13.4249, Perplexity: -12.4471, Bbox Loss: -0.9778 l_accuracy: 0.5454545454545454 step: 158200 total step: 4106220 2022-03-17 11:17:16,410 seq2seq.trainer.supervised_trainer INFO 158200/410622 Progress: 3%, Train Total: -13.5133, Perplexity: -12.6634, Bbox Loss: -0.8499 l_accuracy: 0.5 step: 158250 total step: 4106220 2022-03-17 11:17:19,989 seq2seq.trainer.supervised_trainer INFO 158250/410622 Progress: 3%, Train Total: -13.7298, Perplexity: -13.7187, Bbox Loss: -0.0110 l_accuracy: 0.0 step: 158300 total step: 4106220 2022-03-17 11:17:23,242 seq2seq.trainer.supervised_trainer INFO 158300/410622 Progress: 3%, Train Total: nan, Perplexity: nan, Bbox Loss: nan l_accuracy: 0.0 step: 158350 total step: 4106220 2022-03-17 11:17:27,110 seq2seq.trainer.supervised_trainer INFO 158350/410622 Progress: 3%, Train Total: nan, Perplexity: nan, Bbox Loss: nan l_accuracy: 0.0 step: 158400 total step: 4106220 2022-03-17 11:17:30,312 seq2seq.trainer.supervised_trainer INFO 158400/410622 Progress: 3%, Train Total: nan, Perplexity: nan, Bbox Loss: nan l_accuracy: 0.0 step: 158450 total step: 4106220 2022-03-17 11:17:33,319 seq2seq.trainer.supervised_trainer INFO 158450/410622 Progress: 3%, Train Total: nan, Perplexity: nan, Bbox Loss: nan l_accuracy: 0.0 step: 158500 total step: 4106220 2022-03-17 11:17:36,220 seq2seq.trainer.supervised_trainer INFO 158500/410622 Progress: 3%, Train Total: nan,
I will see whether decay learning rate can help
I followed the instructions in the readme file to download all the meta data and used the below command to start training.
After probably training tens of thousands of batches, nan errors will appear. Both the accuracy rate and loss will become nan. I have tried this several times, and the problem exists. Does anyone know the reason?
---- below is part of the training log file ---- 2020-12-19 19:32:42,559-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73200/410622 Progress: 1%, Train Total: -10.6086, Perplexity: -9.8591, Bbox Loss: -0.7495 2020-12-19 19:32:44,964-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73250/410622 Progress: 1%, Train Total: -10.5154, Perplexity: -10.0869, Bbox Loss: -0.4285 2020-12-19 19:32:47,430-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73300/410622 Progress: 1%, Train Total: -10.2433, Perplexity: -9.7161, Bbox Loss: -0.5272 2020-12-19 19:32:50,007-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73350/410622 Progress: 1%, Train Total: -9.8285, Perplexity: -9.4005, Bbox Loss: -0.4280 2020-12-19 19:32:52,791-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73400/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan 2020-12-19 19:32:55,854-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73450/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan 2020-12-19 19:32:59,942-[supervised_trainer.py:175-_train_epoches()]-INFO-| 73500/410622 Progress: 1%, Train Total: nan, Perplexity: nan, Bbox Loss: nan