This paper proposes to use a target-domain language model as a discriminator in GAN training.
The motivation: The error signal for generator provided by a binary-classifier discriminator is usually unstable and insufficient.
The empirical results show that it is possible to eliminate adversarial steps during training.
Introduced a complete list of related work, such as non-parallel transfer in NLP, GANs, style transfer in CV, LM for reranking.
Unsupervised Text Style Transfer
Review the current approaches from Hu et al. and Shen et al.
Input: unpaired two text dataset X = {x_1, ..., x_m}, Y = {y_1, ..., y_n} and their corresponding styles v_x, v_y (can be a label embedding).
Use an encoder E to encode sentence x/y to get content vector z_x (z_y) = E(x, v_x) (E(y, v_y)).
Use an decoder G to generate style-transferred sentence G(z, v). (x, y notation is ignored).
To guarantee z_x and z_y follow the same distribution, assume p(z) follows a prior distribution and add a KL-divergence regularization on z_x, z_y. (Becomes VAE).
However, the posterior distribution of z fails to capture content of a sentence.
To capture the desired styles in generated sentence, Hu et al. additionally use a style classifier on the generated samples, and the decoder G is trained to maximize the accuracy of the style classifier.
Shen et al. use GAN-training to align z distribution.
Language Models as Discriminators
Model Architectures
Objective
In equation (1) & (2), train LM with GAN-training.
However, since LM is a structured discriminator, we hope that LM only assign high perplexity for negative (fake) sentence, hence negative samples may not be necessary. They add a weight γ to the loss of negative samples for investigating the necessity. If γ = 0, the LM is simply trained on a real sentence.
Experiment shows that adding negative samples sometimes improve the results. However, empirically that using negative samples makes the training very unstable and the model diverges easily.
Training
Train LMs according to equation (1) & (2).
Minimize reconstruction loss.
Continuous approximation (Figure 2)
Use Gumbel-softmax to approximate the output sentence from G, and then compute cross-entropy loss using LM.
Use weighted average of embedding to LM. (See paper for detail)
Question: Why not simply use policy gradient?
Overcoming mode collapse
Preliminary experiments show that LM prefers short sentences.
Two tricks are applied:
Normalize the loss with sentence length.
Fix the length of generated sentence to be the same of input sentence.
Metadata