nmt (Gehring+ 17, Vaswani+ 17+)

variants of the LSTM-based Sequence to Sequence with Attention model, particularly Google Neural Machine Translation, were superseded first by a fully convolutional sequence to sequence model and then by the Transformer (Attention is all, Vaswani+ 2017+)

Prediction with a Short Memory Vatsal Sharan, Sham Kakade, Percy Liang, Gregory Valiant (Submitted on 8 Dec 2016 (v1), last revised 28 Jun 2018 (this version, v5))
The Transformer isn’t strictly a feed-forward model in the style described above (since it doesn’t make the k step conditional independence assumption), but is not really a recurrent model because it doesn’t maintain a hidden state. ↩

makrai / toread

nmt (Gehring+ 17, Vaswani+ 17+) #16