Closed ejls closed 9 years ago
Are you using gradient clipping?
I am not sure which version you are using, but I had NaNs when I didn't divide the cost by the batchsize. I don't know why this happens though (it shouldn't).
On Mon, May 4, 2015 at 2:58 PM, Kyunghyun Cho notifications@github.com wrote:
Are you using gradient clipping?
— Reply to this email directly or view it on GitHub https://github.com/kyunghyuncho/NMT/issues/22#issuecomment-98813174.[image: Web Bug from https://github.com/notifications/beacon/AEPk4i5S6Xu36iRz0Fu9FRb0bakUUuX7ks5oF7jZgaJpZM4EPlTB.gif]
I can reproduce the problem with https://github.com/kyunghyuncho/NMT/commit/41d3e6a020f49733c8184440939fd2bfbff62f91, https://github.com/kyunghyuncho/NMT/commit/6425c296d367b324495781e68ffb2e2a6e574d1c and https://github.com/kyunghyuncho/NMT/commit/e0d24ff5ee3d8cf66b3bda742d8ca67ff7fccc3c .
I'm using orhanf/blocks/wmt15 (with https://github.com/bartvm/blocks/commit/66fcc397ec3a85d021329cac1fe1619322875888 added)
Maybe I've a problem with something else, @orhanf where you able to train without getting NaN?
@kyunghyuncho yes, I'm simply running the code unchanged (it has step clipping set at 10).
@ejls i am running the models with state extensions as _TEST, haven't run a full model
In groundhog, there is these lines which remove the NaN in the gradient.
So when I do the same in the blocks implementation by adding fuel.algorithms.RemoveNotFinite as the first rule here, the model train correctly, except from the fact that the cost is inf
for 3 iterations at the beginning of the training (this might go away after reshuffling the training set).
Another solution is to scale the gradient down, when adding fuel.algorithms.Scale(0.5), I don't get any NaN
/inf
.
@ejls
I guess it means that the norm of the gradient is still too large, is it? Can you make a PR with the first solution to NMT?
https://github.com/bartvm/blocks/commit/66fcc397ec3a85d021329cac1fe1619322875888 makes the cost reach NaN after ~8 iterations on the finnish -> english WMT15 dataset.