graykode / nlp-tutorial

Natural Language Processing Tutorial for Deep Learning Researchers
https://www.reddit.com/r/MachineLearning/comments/amfinl/project_nlptutoral_repository_who_is_studying/
MIT License
14.03k stars 3.9k forks source link

The Adam in 5-1.Transformer should be replaced by SGD #76

Open Cheng0829 opened 1 year ago

Cheng0829 commented 1 year ago

Line 202 : optimizer = optim.Adam(model.parameters(), lr=0.001)

In practice, I think the effect of Adam is quite bad. When epoch = 10, cost is 1.6; when epoch = 100 or 1000, cost is still equal to 1.6. So I think we can change Adam to SGD, that is, optimizer = optim.SGD(model.parameters(), lr=0.001)

Here are the effects of using SGD:

Epoch: 0100 cost = 0.047965
Epoch: 0200 cost = 0.020129
Epoch: 0300 cost = 0.012563
Epoch: 0400 cost = 0.009101
Epoch: 0500 cost = 0.007131
Epoch: 0600 cost = 0.005862
Epoch: 0700 cost = 0.004978
Epoch: 0800 cost = 0.004325
Epoch: 0900 cost = 0.003823
Epoch: 1000 cost = 0.003426