Non-Monotonic Sequential Text Generation

Abstract

propose a framework for training models of text generation in non-monotonic orders
- generate tokens in a binary tree structure
learning is framed as imitation learning
achieves competitive performance with conventional left-to-right generation
- tasks : language modeling, sentence completion, word reordering and machine translation

an example generation of proposed approach.
- generation can start from any tokens.
- number in green box is generation order
- number in blue box is reconstruction order
conventional left-to-right can be framed as a special case of binary tree

Imitation Learning framework where oracle policy provides valid distribution over choices of tokens and model parameter learns it via KL Divergence loss
Oracle policy is defined by
where we have a choice for P_a.
- uniform oracle produces uniform distribution over valid tokens. (does not lead to optimal quality)
- coaching oracle : multiply uniform and current policy
annealed coaching oracle : linear weighted sum of coaching and uniform oracle to provide variety in learning
in imitation learning, roll-in policy is an stochastic mixture of learned model and oracle policy, but in this task, simply using oracle policy throughout performs better

Dataset : Persona-Chat dataset with 133k / 16k / 15k
Model : 2-layered uni-directional LSTM
non-monotonic (annealed) LM produced more diverse(unique and novel) sentences, with average span 1.3~1.4 (span = avg number of child nodes)
POS tag analysis leads to interesting insights
- non-monotonic (annealed) produces in order of PUNCT > PNOUN > VERB > NOUN
- left-to-right produces in order of PNOUN > VERB > NOUN > PUNCT

non-monotonic generation opens up a new spectrum in sentence completion where generation can take place anywhere
- left-to-right can only complete sentences to its right

Dataset : IWSLT16 DeEn 196k / TED tst2013 / TED tst2014
Model : 1-layer bi-LSTM
End-tuning : since end tag is frequent in training, model over-produces end tag during inference. P_a value for end is tuned down with validation set.
7~8 points lower BLEU than Left-to-Right due to drop in 4-gram precision. (1,2-gram is higher, 3-gram is equivalent)
relatively less discrepancy on other metrics but still lower than left-to-right

Left-to-Right seems to be a good inductive bias for generation, that's why there is a big gap in quantitative results on machine translation
Generating tokens in non-monotonic order is far from human's intuitions, but VERY interesting idea
what is the potential gain of generating machine translation outputs in non-monotonic order?
- this idea is interesting, but seems to make the problem more difficult for the model to learn. model now has to learn all combinatorial cases of sentence generation

Link : https://arxiv.org/pdf/1902.02192.pdf Authors : Welleck et al. 2019