2019/12/10 We have changed the model name from MUSE(parallel MUlti-Scale attEntion) to PRIME(PaRallel Intersected Multi-scale AttEntion)
Core Code:
Relevent links:
About the paper:
TL;DR: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.
We ask three questions:
We find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.
Key features:
Results:
Task | size | test (BLEU) |
---|---|---|
IWSLT14 De-En | Base | 36.3 |
WMT14 En-De | Large | 29.9 |
WMT14 En-Fr | Large | 43.5 |
Installing from source
To install from source and develop locally:
pip install --editable . --user
We provide pre-trained models and detailed example training and evaluation in examples/parallel_intersected_multi-scale_attention(Prime)/README.md.
Please cite as:
@article{zhao2019muse,
title={MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning},
author={Zhao, Guangxiang and Sun, Xu and Xu, Jingjing and Zhang, Zhiyuan and Luo, Liangchen},
journal={arXiv preprint arXiv:1911.09483},
year={2019}
}
The code is based on fairseq-0.6.2