propose a framework for training models of text generation in non-monotonic orders
generate tokens in a binary tree structure
learning is framed as imitation learning
achieves competitive performance with conventional left-to-right generation
tasks : language modeling, sentence completion, word reordering and machine translation
Details
Non-Monotonic Generation as Binary-Tree
an example generation of proposed approach.
generation can start from any tokens.
number in green box is generation order
number in blue box is reconstruction order
conventional left-to-right can be framed as a special case of binary tree
Learning for Non-Monotonic Generation
Imitation Learning framework where oracle policy provides valid distribution over choices of tokens and model parameter learns it via KL Divergence loss
Oracle policy is defined by
where we have a choice for P_a.
uniform oracle produces uniform distribution over valid tokens. (does not lead to optimal quality)
coaching oracle : multiply uniform and current policy
annealed coaching oracle : linear weighted sum of coaching and uniform oracle to provide variety in learning
in imitation learning, roll-in policy is an stochastic mixture of learned model and oracle policy, but in this task, simply using oracle policy throughout performs better
Experiments
Language Model
Dataset : Persona-Chat dataset with 133k / 16k / 15k
Model : 2-layered uni-directional LSTM
non-monotonic (annealed) LM produced more diverse(unique and novel) sentences, with average span 1.3~1.4 (span = avg number of child nodes)
POS tag analysis leads to interesting insights
non-monotonic (annealed) produces in order of PUNCT > PNOUN > VERB > NOUN
left-to-right produces in order of PNOUN > VERB > NOUN > PUNCT
Sentence Completion
non-monotonic generation opens up a new spectrum in sentence completion where generation can take place anywhere
left-to-right can only complete sentences to its right
End-tuning : since end tag is frequent in training, model over-produces end tag during inference. P_a value for end is tuned down with validation set.
7~8 points lower BLEU than Left-to-Right due to drop in 4-gram precision. (1,2-gram is higher, 3-gram is equivalent)
relatively less discrepancy on other metrics but still lower than left-to-right
Personal Thoughts
Left-to-Right seems to be a good inductive bias for generation, that's why there is a big gap in quantitative results on machine translation
Generating tokens in non-monotonic order is far from human's intuitions, but VERY interesting idea
what is the potential gain of generating machine translation outputs in non-monotonic order?
this idea is interesting, but seems to make the problem more difficult for the model to learn. model now has to learn all combinatorial cases of sentence generation
Abstract
Details
Non-Monotonic Generation as Binary-Tree
Learning for Non-Monotonic Generation
Imitation Learning framework where
oracle
policy provides valid distribution over choices of tokens and model parameter learns it via KL Divergence lossOracle
policy is defined bywhere we have a choice for
P_a
.annealed coaching oracle : linear weighted sum of coaching and uniform oracle to provide variety in learning
in imitation learning, roll-in policy is an stochastic mixture of learned model and oracle policy, but in this task, simply using oracle policy throughout performs better
Experiments
Language Model
Dataset : Persona-Chat dataset with 133k / 16k / 15k
Model : 2-layered uni-directional LSTM
non-monotonic (
annealed
) LM produced more diverse(unique and novel) sentences, with average span 1.3~1.4 (span = avg number of child nodes)POS tag analysis leads to interesting insights
annealed
) produces in order of PUNCT > PNOUN > VERB > NOUNSentence Completion
Machine Translation
end
tag is frequent in training, model over-producesend
tag during inference.P_a
value forend
is tuned down with validation set.Personal Thoughts
Link : https://arxiv.org/pdf/1902.02192.pdf Authors : Welleck et al. 2019