Insertion-based Decoding with Automatically Inferred Generation Order

Abstract

propose a novel decoding algorithm - INDIGO, which generates text in an arbitrary order via insertion operation
achieve competitive or even better performance in machine translation than conventional left-to-right generation.
- Dataset : WMT16 RoEn, WMT18 EnTr, KFTT EnJa

INsertion based Decoding with Inferred Generation Order
assumes generation orders as latent variables
use relative position representation to capture generation order
use Transformer model with relative position
maximize evidence lower-bound (ELBO) of the original objective function and study four approximate posterior distribution of generation orders

neural autoregressive model commonly learns the probability of a Y given X via product of probability of each token given X and previously generated tokens Y_t
a common way to decode such sequence model was from left-to-right, as it is a natural for most human-beings to read sequences (strong inductive bias)
however, L2R may not be the optimal option for generating all sequences
- Japanese tend to produce better result in R2L
- code generation is beneficial when generated on abstract syntax tree etc

it is essential to use relative representation to model position because we do not know how many tokens will be generated at the end
relational vector is used to model relative positions of all tokens at each timestep. accumulating relational vector across all timestep leads to relational matrix

INDIGO predicts next token and its relative position at each timestep, shown in Alg 1

maximizing marginalized likelihood is intractable because we need to consider all T! permutations of tokens, given that tokens are now order-free
instead, we maximize the evidence lower bound of original objective by introducing an approximate posterior distribution of generation orders which we can flexibly control

Datasets
- WMT16 RoEn 620k / 2k / 2k
- WMT18 EnTr 207k / 3k / 3k
- KFTT EnJa 405k / 1k / 1k
Result
- except for Random order, all pre-defined orders perform relatively similar, but L2R / R2L is best
- Adaptive Order with beam=8 performs better than L2R, R2L in all language pairs

improvement via INDIGO is more vivid in word order recovery and code generation tasks

paper was bit difficult to read
predicting tokens and its positions autoregressively is an interesting idea
wish more ablation was on what kind of tokens model predicts first in terms of POS, frequency etc
interesting to see common-first approach is worse than L2R/R2L. surprised to find out how strong L2R inductive bias is