lancopku / Prime

A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.
Other
87 stars 9 forks source link
attention language-model sequence-to-sequence transformer

News

2019/12/10 We have changed the model name from MUSE(parallel MUlti-Scale attEntion) to PRIME(PaRallel Intersected Multi-scale AttEntion)

Introduction

Core Code:

Relevent links:

About the paper:

TL;DR: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

We ask three questions:

We find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.

Key features:

Results:

  1. Better than previous models on large NMT datasets; can scale to small datasets and base model setting.
  2. The shared projection is key to combine conv and self-attn; generate better long sequences;potential for acceleration. )
Task size test (BLEU)
IWSLT14 De-En Base 36.3
WMT14 En-De Large 29.9
WMT14 En-Fr Large 43.5

Requirements and Installation

Installing from source

To install from source and develop locally:

pip install --editable . --user

We provide pre-trained models and detailed example training and evaluation in examples/parallel_intersected_multi-scale_attention(Prime)/README.md.

Citation

Please cite as:

@article{zhao2019muse,
  title={MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning},
  author={Zhao, Guangxiang and Sun, Xu and Xu, Jingjing and Zhang, Zhiyuan and Luo, Liangchen},
  journal={arXiv preprint arXiv:1911.09483},
  year={2019}
}

Notes

The code is based on fairseq-0.6.2