Closed JulesGM closed 6 years ago
Makes sense, thanks for that. Can you please run the copy-task notebook and see we're getting the same results?
I trained a bunch of pretty long models, and get good results in the notebooks.
Like this one, which was trained for a while on sequences up to 120 long, and converges very sharply
Tested it as well, seems to alter convergence a bit but perhaps for the better.
Removed the softplus in the softmax:
softmax already constrains the values to (0, 1), the softplus doesn't achieve anything. Pytorch's softmax implementation is already numerically stable, so that's not the preoccupation.