Reimplementation training not stable

Dear @seoungwugoh , I've read your paper and found your work extremely interesting. I've been trying to reproduce the work according to your paper, with some minor changes, like decoder layers and such. The memory read operation which is very much like transformer's attention mechanism is taken from this repo. Others, all reimplemented according to your paper's description.

I've been trying to train the model, loss goes down initially, and after a while it suddenly shoots up. I've tried:

clipping the gradient norm;
lowering learning rate;
removing skip connections (to make sure model actually tries to make use of temporal information (memory))

I've not tried disabling the batch norm as your paper suggests; and I'm using mixed precision training with Apex AMP.

Have you experienced such training instability before? What do you think could be the problem?

seoungwugoh / STM

Reimplementation training not stable #33