howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

On Accurate Evaluation of GANs for Language Generation #21

Open howardyclo opened 5 years ago

howardyclo commented 5 years ago

Metadata

Authors: Stanislau Semeniuta, Aliaksei Severyn, Sylvain Gelly Organization: Google AI Conference: NIPS 2018 Paper: https://arxiv.org/pdf/1806.04936.pdf

howardyclo commented 5 years ago

Summary

This paper presents a systematic re-evaluation for a number of recently purposed GAN-based text generation models. They argued that N-gram based metrics (e.g., BLEU, Self-BLEU) is not sensitive to semantic deterioration of generated texts and propose alternative metrics that better capture the quality and diversity of the generated samples. Their main finding is that, neither of the considered models performs convincingly better than the conventional Language Model (LM). Furthermore, when performing a hyperparameter search, they consistently find that adversarial learning hurts performance, further indicating that the Language Model is still a hard-to-beat model.

Motivations for GAN-based Text Generation Models

Motivations for This Paper

Types fo Benchmarked Models

Tricks to Address RL Issues for Training Discrete GAN Models

Discrete GAN Models

SeqGAN

Equation 1

Equation 2

LeakGAN

Reveal D's state to the G. They found that it is important to fuse D's and G's state with a non-linear function and thus they use a one-layer MLP to predict the distribution over the next token. In this setup, they use per-step discriminator. Three variants of the LeakGAN model that differ in how a hidden state of D is made available to G are considered: LeakGAN-leak, LeakGAN-noleak and LeakGAN-mixed:

Figure 1

Evaluation Methodology

Metrics

Parameter optimization procedure

They optimize each model's hyperparameters with 100 trials of random search. Once the best hyperparameters is found, they retrain with these hyperparameters 7 times and report mean and standard deviation for each metric to quantify how sensitive the model is to random initialization.

Data

600k unique sentences from SNLI & MultiNLI dataset, preprocessed with the unsupervised text tokenization model SentencePiece, with a vocabulary size equal to 4k.

Metric Comparison

Figure 2

Figure 3

The experiments suggest that both FD and reverse LM score can be successfully used as a metric for unsupervised sequence generation models.

GAN Model Comparison (Important Findings)

For all GAN models, they fix the generator to be one-layer Long ShortTerm Memory (LSTM) network.

Table 2

Future Research

Related Work

xpqiu commented 5 years ago

Dear howardyclo, Thanks for your great comments. I just suggest another BLEU based metric to better capture the quality and diversity of the generated samples. The overall BLEU score consists of two parts: a forward BLEU aims to measure the precision (quality) of the generator, while backward BLEU aims to measure the recall (diversity) of the generator. The details is shown in the bellowing paper. “Toward Diverse Text Generation with Inverse Reinforcement Learning” https://arxiv.org/pdf/1804.11258.pdf

howardyclo commented 5 years ago

@xpqiu Hi Xipeng Actually, your paper is currently in my paper reading pending list! Ha! I will soon get to your paper and write a summary about it. It would be great if you can help me to review the summary, thanks!