Open howardyclo opened 6 years ago
This paper presents a systematic re-evaluation for a number of recently purposed GAN-based text generation models. They argued that N-gram based metrics (e.g., BLEU, Self-BLEU) is not sensitive to semantic deterioration of generated texts and propose alternative metrics that better capture the quality and diversity of the generated samples. Their main finding is that, neither of the considered models performs convincingly better than the conventional Language Model (LM). Furthermore, when performing a hyperparameter search, they consistently find that adversarial learning hurts performance, further indicating that the Language Model is still a hard-to-beat model.
Reveal D's state to the G. They found that it is important to fuse D's and G's state with a non-linear function and thus they use a one-layer MLP to predict the distribution over the next token. In this setup, they use per-step discriminator. Three variants of the LeakGAN model that differ in how a hidden state of D is made available to G are considered: LeakGAN-leak, LeakGAN-noleak and LeakGAN-mixed:
They optimize each model's hyperparameters with 100 trials of random search. Once the best hyperparameters is found, they retrain with these hyperparameters 7 times and report mean and standard deviation for each metric to quantify how sensitive the model is to random initialization.
600k unique sentences from SNLI & MultiNLI dataset, preprocessed with the unsupervised text tokenization model SentencePiece, with a vocabulary size equal to 4k.
The experiments suggest that both FD and reverse LM score can be successfully used as a metric for unsupervised sequence generation models.
For all GAN models, they fix the generator to be one-layer Long ShortTerm Memory (LSTM) network.
Dear howardyclo, Thanks for your great comments. I just suggest another BLEU based metric to better capture the quality and diversity of the generated samples. The overall BLEU score consists of two parts: a forward BLEU aims to measure the precision (quality) of the generator, while backward BLEU aims to measure the recall (diversity) of the generator. The details is shown in the bellowing paper. “Toward Diverse Text Generation with Inverse Reinforcement Learning” https://arxiv.org/pdf/1804.11258.pdf
@xpqiu Hi Xipeng Actually, your paper is currently in my paper reading pending list! Ha! I will soon get to your paper and write a summary about it. It would be great if you can help me to review the summary, thanks!
Metadata
Authors: Stanislau Semeniuta, Aliaksei Severyn, Sylvain Gelly Organization: Google AI Conference: NIPS 2018 Paper: https://arxiv.org/pdf/1806.04936.pdf