propose a novel decoding algorithm - INDIGO, which generates text in an arbitrary order via insertion operation
achieve competitive or even better performance in machine translation than conventional left-to-right generation.
Dataset : WMT16 RoEn, WMT18 EnTr, KFTT EnJa
Details
INDIGO
INsertion based Decoding with Inferred Generation Order
assumes generation orders as latent variables
use relative position representation to capture generation order
use Transformer model with relative position
maximize evidence lower-bound (ELBO) of the original objective function and study four approximate posterior distribution of generation orders
Neural Autoregressive Decoding
neural autoregressive model commonly learns the probability of a Y given X via product of probability of each token given X and previously generated tokens Y_t
a common way to decode such sequence model was from left-to-right, as it is a natural for most human-beings to read sequences (strong inductive bias)
however, L2R may not be the optimal option for generating all sequences
Japanese tend to produce better result in R2L
code generation is beneficial when generated on abstract syntax tree etc
Ordering as latent variable
add order function pi to the conditional probability
L2R can be recovered if z_t = t
Relative Representation of Positions
it is essential to use relative representation to model position because we do not know how many tokens will be generated at the end
relational vector is used to model relative positions of all tokens at each timestep. accumulating relational vector across all timestep leads to relational matrix
Insertion based Decoding
INDIGO predicts next token and its relative position at each timestep, shown in Alg 1
Learning
maximizing marginalized likelihood is intractable because we need to consider all T! permutations of tokens, given that tokens are now order-free
instead, we maximize the evidence lower bound of original objective by introducing an approximate posterior distribution of generation orders which we can flexibly control
Experiment - Machine Translation
Datasets
WMT16 RoEn 620k / 2k / 2k
WMT18 EnTr 207k / 3k / 3k
KFTT EnJa 405k / 1k / 1k
Result
except for Random order, all pre-defined orders perform relatively similar, but L2R / R2L is best
Adaptive Order with beam=8 performs better than L2R, R2L in all language pairs
Experiment - Word Order Recovery / Code Generation
improvement via INDIGO is more vivid in word order recovery and code generation tasks
Personal Thoughts
paper was bit difficult to read
predicting tokens and its positions autoregressively is an interesting idea
wish more ablation was on what kind of tokens model predicts first in terms of POS, frequency etc
interesting to see common-first approach is worse than L2R/R2L. surprised to find out how strong L2R inductive bias is
Abstract
INDIGO
, which generates text in an arbitrary order via insertion operationDetails
INDIGO
INsertion based Decoding with Inferred Generation Order
Neural Autoregressive Decoding
natural
for most human-beings to read sequences (strong inductive bias)Ordering as latent variable
pi
to the conditional probabilityz_t = t
Relative Representation of Positions
Insertion based Decoding
Learning
Experiment - Machine Translation
Experiment - Word Order Recovery / Code Generation
Personal Thoughts
Link : https://arxiv.org/pdf/1902.01370.pdf Authors : Gu et al. 2019