News

2019/12/10 We have changed the model name from MUSE(parallel MUlti-Scale attEntion) to PRIME(PaRallel Intersected Multi-scale AttEntion)

Introduction

Core Code:

Code for parallel representation learning: fairseq\models\combine_transformer.py
Code for combining convolution and self-attention: fairseq\modules\multihead_attention.py
Code for acceleration, bm means big matrix: fairseq\models\transformer_bm.py

Relevent links:

Arxiv pdf: https://arxiv.org/abs/1911.09483
Pre-trained models as well as instructions for training: examples/parallel_intersected_multi-scale_attention(Prime)/README.md
Reddit post link

About the paper:

TL;DR: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

We ask three questions:

Is attention alone good enough？
Is parallel representation learning applicable to sequence data and tasks?
How to design a module that combines both inductive bias of convolution and self-attention？

We find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.

Key features:

Design a multi-branch schema evolving self attention and first successfully combine convolution and self-attention in one module for sequence tasks by the proposed shared projection,
SOTA on three main translation datasets, including WMT14 En-Fr, WMT14 En-De and IWSLT14 De-En,
Parallel learn sequence representations and thus have potential for acceleration.

Results:

Better than previous models on large NMT datasets; can scale to small datasets and base model setting.
The shared projection is key to combine conv and self-attn; generate better long sequences;potential for acceleration. )

Task	size	test (BLEU)
IWSLT14 De-En	Base	36.3
WMT14 En-De	Large	29.9
WMT14 En-Fr	Large	43.5

Requirements and Installation

PyTorch version >= 1.0.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
torch==1.3.1 with cuda==10.0

Installing from source

To install from source and develop locally:

pip install --editable . --user

We provide pre-trained models and detailed example training and evaluation in examples/parallel_intersected_multi-scale_attention(Prime)/README.md.

Citation

Please cite as:

@article{zhao2019muse,
  title={MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning},
  author={Zhao, Guangxiang and Sun, Xu and Xu, Jingjing and Zhang, Zhiyuan and Luo, Liangchen},
  journal={arXiv preprint arXiv:1911.09483},
  year={2019}
}

Notes

The code is based on fairseq-0.6.2

lancopku / Prime

readme

News

Introduction

Requirements and Installation

Citation

Notes