Structured Attention Networks

Abstract

Adding structure to Attention module - a linear chain conditional random field and a graph-based parsing model
Experiments on tree transduction, neural machine translation, question answering and natural language inference shows better performance and improved behavior

Attention is a function of key, value and query where key holds the whole context, query holds the context to be answered and value holds relevant contents.
In author's word, attention mechanism is the expectation of an annotation function with respect to a latent variable which is parameterized to be function of source and query.
Segmentation Attention
- using linear-chain CRF with pairwise edges, it adds pairwise structure
Syntactic Attention
- using graph-based parsing model, it adds tree-like structure
End-to-End training
- forward pass is simple
- backprop is not fully optimized with off-the-shelf tools
- training takes 5x slower than simple attention mechanism, inference is almost similar
Neural Machine Translation
- EnJa data from WAT, 500k sentences, less50
- character-level and word-level with vocab cut-off of 10
- result is not significant in word-level, slight increase in character-level
- Visualization of attention : shows richer, denser attention when structure is added

Agree on enriching the attention mechanism is a good are of research
not sure EnJa from WAT was good benchmark corpus, no significant improvement
too much information is compressed into attention mechanism
- even the single token holds distributed context/content from surrounding.