Adding structure to Attention module - a linear chain conditional random field and a graph-based parsing model
Experiments on tree transduction, neural machine translation, question answering and natural language inference shows better performance and improved behavior
Details
Attention is a function of key, value and query where key holds the whole context, query holds the context to be answered and value holds relevant contents.
In author's word, attention mechanism is the expectation of an annotation function with respect to a latent variable which is parameterized to be function of source and query.
Segmentation Attention
using linear-chain CRF with pairwise edges, it adds pairwise structure
Syntactic Attention
using graph-based parsing model, it adds tree-like structure
End-to-End training
forward pass is simple
backprop is not fully optimized with off-the-shelf tools
training takes 5x slower than simple attention mechanism, inference is almost similar
Neural Machine Translation
EnJa data from WAT, 500k sentences, less50
character-level and word-level with vocab cut-off of 10
result is not significant in word-level, slight increase in character-level
Visualization of attention : shows richer, denser attention when structure is added
Personal Thoughts
Agree on enriching the attention mechanism is a good are of research
not sure EnJa from WAT was good benchmark corpus, no significant improvement
too much information is compressed into attention mechanism
even the single token holds distributed context/content from surrounding.
Abstract
Details
Attention is a function of key, value and query where key holds the whole context, query holds the context to be answered and value holds relevant contents.
In author's word, attention mechanism is the expectation of an annotation function with respect to a latent variable which is parameterized to be function of source and query.
Segmentation Attention
Syntactic Attention
End-to-End training
Neural Machine Translation
Personal Thoughts
Link : https://arxiv.org/pdf/1702.00887.pdf Authors : Kim et al 2017