Accelerating Neural Transformer via an Average Attention Network

kweonwooj commented 6 years ago

Abstract

Propose Average Attention Network module that serves as decoder for Transformer. Decoding speed improves x3~4 while preserving translation performance.
Empirical evidence shown in 6 different language pairs of WMT17 datasets

Details

Introduction

screen shot 2018-05-30 at 2 03 36 pm

Problem
- self-attention network requires calculation over all previous decoded tokens, which slows down the decoding speed (Figure 1), compared to RNN and CNN
Solution
- decoder in original Transformer learns decoder self-attention, but proposed AAN fixes the decoder self-attention to be uniform average weight
- uses similar module sequence as original Transformer for efficient gradient flow

new Transformer

screen shot 2018-05-30 at 2 06 23 pm

new Transformer replaces decoder self-attention module with Average Attention Network
AAN consists of average layer and gating layer with residual connection
Comparing max path length, complexity and min number of sequential operation for different models. Proposed AAN has similar values as original Transformer.

Training

training can be parallelized via Masked AAN when we already have reference target
- only trained upto 100k
- model architecture is not Transformer big
- used open-source thumt
BLEU result is on-par with original Transformer on six different language pairs with different corpus size

Decoding

screen shot 2018-05-30 at 2 09 50 pm screen shot 2018-05-30 at 2 12 28 pm

decoding is much faster (x3.7 ~) than original Transformer

Personal Thoughts

Decoder seems to be over-parameterized in original Transformer
- how can we prove above statement using original Transformer only?
author states uniform averaging allows better information flow of long-term dependencies, but hope there is some experiment backing it up.

Link : https://arxiv.org/pdf/1805.00631.pdf Authors : Zhang et al. 2018

chqiwang commented 6 years ago

It seems like the baseline Transformer is not well optimized. What do you think?

kweonwooj commented 6 years ago

@chqiwang I agree. they do not follow the model architecture of Transformer Big and is only trained upto 100k steps. I suppose the reason is their hardware limitation, they use single GTX 1080.

I like the direction of the paper, Transformer Decoder is over-capacity and I believe decoding can be made more efficient w/o huge loss of performance.

kweonwooj / papers