The Adam in 5-1.Transformer should be replaced by SGD

Line 202 : optimizer = optim.Adam(model.parameters(), lr=0.001)

In practice, I think the effect of Adam is quite bad. When epoch = 10, cost is 1.6; when epoch = 100 or 1000, cost is still equal to 1.6. So I think we can change Adam to SGD, that is, optimizer = optim.SGD(model.parameters(), lr=0.001)

Here are the effects of using SGD：

Epoch: 0100 cost = 0.047965
Epoch: 0200 cost = 0.020129
Epoch: 0300 cost = 0.012563
Epoch: 0400 cost = 0.009101
Epoch: 0500 cost = 0.007131
Epoch: 0600 cost = 0.005862
Epoch: 0700 cost = 0.004978
Epoch: 0800 cost = 0.004325
Epoch: 0900 cost = 0.003823
Epoch: 1000 cost = 0.003426

graykode / nlp-tutorial

The Adam in 5-1.Transformer should be replaced by SGD #76