Closed InitialBug closed 6 years ago
In the original paper 《attention is all you need》, it seems the position encoding is direct computed rather than trained. But in your code, the final parameter is wrapped with torch.nn.Parameter, is that OK?
@InitialBug You are right. I should set requires_grad=False. Thanks!
requires_grad=False
In the original paper 《attention is all you need》, it seems the position encoding is direct computed rather than trained. But in your code, the final parameter is wrapped with torch.nn.Parameter, is that OK?