DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

Abstract

Propose a novel fully-attention based neural net, Directional Self-Attention Network (DiSAN)
achieves SoTA in NLP tasks
- SNLI, SST, MultiNLI, SICK, MPQA, TREC, SUBJ datasets

In NLP tasks, two types of DNN are commonly used
- RNN : LSTM, GRU, a natural choice for sequential task
- CNN : with hierarchical structure that captures local or position-invariant features
Recently, fully attentional seq2seq model was proposed (Transformer)
The contributions of this paper are
- propose multi-dimensional attention
- propose directional attention
- propose DiSAN using above blocks and show SoTA performance

Traditional alignment score outputs a single scalar value per token that acts as a weight to produce a context vector
Multi-Dimensional attention produces a vector per token, which acts as a weight for each value in token to produce a context vector
Two types of Multi-dimensional Self-Attention
- token2token : computes self-attention between all combination of tokens in a sentence
- source2token : compute self-attention of each token to a whole sentence. W learns a common weight for all tokens in a sentence.

use above proposed blocks to build a DiSAN
Architecture for DiSA block
- word embedding -> FCN -> token2token self-attention with positional mask (forward, backward or diag-disabled) -> Fusion Gate like LSTM gate with residual input