Open kweonwooj opened 6 years ago
It seems like the baseline Transformer is not well optimized. What do you think?
@chqiwang
I agree. they do not follow the model architecture of Transformer Big
and is only trained upto 100k steps. I suppose the reason is their hardware limitation, they use single GTX 1080.
I like the direction of the paper, Transformer Decoder is over-capacity
and I believe decoding can be made more efficient w/o huge loss of performance.
Abstract
Average Attention Network
module that serves as decoder for Transformer. Decoding speed improves x3~4 while preserving translation performance.Details
Introduction
AAN
fixes the decoder self-attention to be uniform average weightnew Transformer
Average Attention Network
AAN
consists of average layer and gating layer with residual connectionAAN
has similar values as original Transformer.Training
Transformer big
thumt
Decoding
Personal Thoughts
Link : https://arxiv.org/pdf/1805.00631.pdf Authors : Zhang et al. 2018