RNN : LSTM, GRU, a natural choice for sequential task
CNN : with hierarchical structure that captures local or position-invariant features
Recently, fully attentional seq2seq model was proposed (Transformer)
The contributions of this paper are
propose multi-dimensional attention
propose directional attention
propose DiSAN using above blocks and show SoTA performance
Multi-Dimensional Attention
Traditional alignment score outputs a single scalar value per token that acts as a weight to produce a context vector
Multi-Dimensional attention produces a vector per token, which acts as a weight for each value in token to produce a context vector
Two types of Multi-dimensional Self-Attention
token2token : computes self-attention between all combination of tokens in a sentence
source2token : compute self-attention of each token to a whole sentence. W learns a common weight for all tokens in a sentence.
Directional Self-Attention Network
use above proposed blocks to build a DiSAN
Architecture for DiSA block
word embedding -> FCN -> token2token self-attention with positional mask (forward, backward or diag-disabled) -> Fusion Gate like LSTM gate with residual input
Experiments
SoTA in various NLP tasks
Case Studies
same word in different context has drastically different embedding
would have been better if there were comparison to existing models
Abstract
Directional Self-Attention Network (DiSAN)
Details
Introduction
LSTM, GRU
, a natural choice for sequential taskTransformer
)multi-dimensional attention
directional attention
DiSAN
using above blocks and show SoTA performanceMulti-Dimensional Attention
Traditional alignment score outputs a single scalar value per token that acts as a weight to produce a context vector
Multi-Dimensional attention produces a vector per token, which acts as a weight for each value in token to produce a context vector
Two types of Multi-dimensional Self-Attention
token2token
: computes self-attention between all combination of tokens in a sentencesource2token
: compute self-attention of each token to a whole sentence.W
learns a common weight for all tokens in a sentence.Directional Self-Attention Network
use above proposed blocks to build a DiSAN
Architecture for DiSA block
word embedding
->FCN
->token2token
self-attention with positional mask (forward, backward or diag-disabled) ->Fusion Gate
like LSTM gate with residual inputExperiments
Case Studies
Personal Thoughts
Link : https://arxiv.org/pdf/1709.04696.pdf Authors : Shen et al. 2017