NaN cost when training with the new AdaDelta parameters

kyunghyuncho / NMT

1 stars 1 forks source link

NaN cost when training with the new AdaDelta parameters #22

Closed ejls closed 9 years ago

ejls commented 9 years ago

https://github.com/bartvm/blocks/commit/66fcc397ec3a85d021329cac1fe1619322875888 makes the cost reach NaN after ~8 iterations on the finnish -> english WMT15 dataset.

kyunghyuncho commented 9 years ago

Are you using gradient clipping?

sebastien-j commented 9 years ago

I am not sure which version you are using, but I had NaNs when I didn't divide the cost by the batchsize. I don't know why this happens though (it shouldn't).

On Mon, May 4, 2015 at 2:58 PM, Kyunghyun Cho notifications@github.com wrote:

Are you using gradient clipping?

— Reply to this email directly or view it on GitHub https://github.com/kyunghyuncho/NMT/issues/22#issuecomment-98813174.[image: Web Bug from https://github.com/notifications/beacon/AEPk4i5S6Xu36iRz0Fu9FRb0bakUUuX7ks5oF7jZgaJpZM4EPlTB.gif]

ejls commented 9 years ago

I can reproduce the problem with https://github.com/kyunghyuncho/NMT/commit/41d3e6a020f49733c8184440939fd2bfbff62f91, https://github.com/kyunghyuncho/NMT/commit/6425c296d367b324495781e68ffb2e2a6e574d1c and https://github.com/kyunghyuncho/NMT/commit/e0d24ff5ee3d8cf66b3bda742d8ca67ff7fccc3c .

I'm using orhanf/blocks/wmt15 (with https://github.com/bartvm/blocks/commit/66fcc397ec3a85d021329cac1fe1619322875888 added)

Maybe I've a problem with something else, @orhanf where you able to train without getting NaN?

ejls commented 9 years ago

@kyunghyuncho yes, I'm simply running the code unchanged (it has step clipping set at 10).

orhanf commented 9 years ago

@ejls i am running the models with state extensions as _TEST, haven't run a full model

ejls commented 9 years ago

In groundhog, there is these lines which remove the NaN in the gradient.

So when I do the same in the blocks implementation by adding fuel.algorithms.RemoveNotFinite as the first rule here, the model train correctly, except from the fact that the cost is inf for 3 iterations at the beginning of the training (this might go away after reshuffling the training set).

Another solution is to scale the gradient down, when adding fuel.algorithms.Scale(0.5), I don't get any NaN/inf.

kyunghyuncho commented 9 years ago

@ejls

I guess it means that the norm of the gradient is still too large, is it? Can you make a PR with the first solution to NMT?