howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Adversarial Contrastive Estimation #23

Open howardyclo opened 6 years ago

howardyclo commented 6 years ago

Metadata

howardyclo commented 6 years ago

Summary

This paper proposes to augment the negative sampling process in contrastive learning with an adversarially learned conditional distribution, resulting in a negative sampler that adapts to the data distribution, training dynamics and finds harder negatives (refered to as Adversarial Contrastive Estimation (ACE)). They demonstrate the efficacy and generality of ACE on learning word embeddings, order embeddings and knowledge graph embeddings.


Motivation of This Paper (Problems of Previous Approach)


Background: Contrastive Learning

Contrastive learning is a sampling-based learning method that contrasting losses on observed positive examples with those on sampled fictitious negative examples, trying to minimize loss on positive ones while maximizing it on negative ones. One of the popular algorithms is Noise Contrastive Estimation (NCE) and has been applied on learning word embeddings. See the bellowing comment for Notes on Noise Contrastive Estimation.

Why contrastive learning?

Objective in general form:

In most general form, the objective of contrastive learning:

Eq1

By the law of total expectation and the fact that given x+, the negative sampling y- is not dependent on the y+ (i.e., p(y+, y- | x) = p(y+ | x+) p(y- | x+)). Thus, rewrite Eq. 1 as:

Eq2

Non-separable loss and separable loss

There are two cases of l_ω(x+, y+, y-) depends on different problems:

In NCE, we make p(y-|x+) to be some unconditional p_{nce}(y-) in Eq. 2 or 3, which leads to efficient computation but sacrifices the sampling efficiency (y- might not be a hard negative example).


Purposed Method: Adversarial Contrastive Estimation (ACE)

To remedy the above problem of a fixed unconditional negative sampler p_{nce}(y-), ACE augments it into an adversarial mixture negative sampler: λ p_{nce}(y-) + (1 − λ) gθ(y-|x+).

Training objectives for embedding model (discriminator parameterized by ω) and gθ (generator parameterized by θ)

The Eq. 2 can be written as (ignore E_p(x+) for notational brevity): Eq4

Then, learn (ω, θ) with GAN-style training: Eq5 The generator is trained via policy gradient ∇θL(θ, x): Eq6 where the expectation is w.r.t. p(y+|x) gθ(y-|x), and the discriminator loss l_ω(x, y+, y-) acts as the reward.

Note: The generator aims to maximize L(.), thus the reward is l_ω(x, y+, y-)

Question: Many papers did not add negative sign to the reward term. Due to the update rule of REINFORCE is gradient ascent, I think there is no negative sign to the reward term.

REINFORCE

We can also rewrite L(ω, θ; x) when l_ω is separable-loss: Eq7

The bellowing is the derivation of L(ω, θ; x) with separable-loss l_ω: Derivation

Question: The authors ignore λ terms in the Eq. 7, which 'm not sure whether it's correct or not.

When in separable-loss case, the reward term in policy gradient ∇θL(θ, x) for updating the generator would be: −s˜(x+, y−), since only the last term depends on the generator parameter θ. (<- the original paper writes ω which is wrong!).

Note: The generator aims to maximize L(.), which is equivalent to minimize the last term -E_{gθ(y-|x)} s˜(x+, y−). Thus, the reward is −s˜(x+, y−). Again if in non-separable case, the reward term becomes l_ω(x, y+, y-).


Techniques for Further Improvement

Regularization for gθ: Entropy and training stability

To avoid mode collapse in GAN training, they propose to add a regularization term for gθ to encourage it to have high entropy. Eq8 where H(gθ(y-|x)) is the entropy of , and c = log(k) is the entropy of a uniform distribution over k choices. Intuitively, R_{ent} expresses the prior that the generator should spread its mass over more than k choices for each x.

H(x) = - Σ_{i}^{k} P(x_i) log P(x_i) (by entropy definition) = - k * (1/k) log (1/k) (by uniform distribution) = -log (1/k) = -log k^(-1) = log k

Note that it is actually "max" instead of "min" in Rent equation. (I just confirmed it with the authors).

Handling false negatives

ACE samples false negatives more than NCE. Thus, two strategies are further introduced (although entropy regularization reduces this problem):

Variance reduction

They subtract the baseline from reward to reduce the variance of policy gradient estimator with self-critical baseline method, where the baseline is:

Improving exploration in by leveraging NCE samples

Reweighting the reward term in Eq. 6 by gθ(y−|x) / p_{nce}(y−).

My understanding is that, if p_{nce} has already sampled y-, then make gθ(y−|x) sample y- less. Authors state that this is essentially "off-policy" exploration in reinforcement learning since NCE samples are not drawn according to .

Experiment Settings

Word embeddings

Hypernym prediction

Knowledge graph embeddings


Experiment Results

Fig1234 (ACE: mixture of p_{nce} and ; ADV: only )

Table1.

Table23

Fig5 (Knowledge graph embedding training progress. Ent: Entropy Regularization; SC: Self-critical baseline; IW: off-policy learning)

Table4 (Knowledge graph embedding performance)


Limitations


Reference

howardyclo commented 6 years ago

Notes on Noise Contrastive Estimation (NCE)

Idea

In neural language modeling, computing the probability normalization term in softmax is expensive. Thus, NCE and its simplified variants, negative sampling transform the computationally expensive learning problem into a binary classification proxy problem that use the same parameters θ to distinguish samples from the empirical distribution (i.e., (w+, c)~p*(w|c)) from samples generated the noise distribution q(w) (i.e., (w-, c)~q(w)). In practice, q(w) is a uniform unigram distribution). Note: we denote w+ is a positive example, w- is a negative example and c is a given condition (context).

Neural language modeling: Train a neural network pθ(w | c) = uθ(w, c) / Z(c) to approximate the empirical (training data) distribution p*(w|c) as closely as possible. uθ(w, c) = exp sθ(w, c), where sθ(w, c) is a differentiable function that assigns score to a word in context. Z(c) is the probability normalization term.

Training

During training, we form the training data by sampling one positive example (w+, c) and k negative examples (w-, c) with labels 1 and 0 respectively.

Formulate the above description into a mathematical expression:

NCE1

And using the definition of conditional probability:

NCE2

NCE replaces the empirical distribution p*(w|c) with the model distribution pθ(w|c) and makes 2 assumptions to solve the computation problem in Z(c):

Making these assumptions, rewrite the p*(w|c) (which is replaced with pθ(w|c)) to uθ(w, c) in the above equations:

NCE5

Note: In simplified variant of NCE, negative sampling, the term k × q(w) becomes to 1. It is equivalent to NCE when k = |V| and q is uniform.

The objective function of maximizing the above conditional log-likelihood of D in above equations would be:

NCE3

Since the expectation of the second term is still a difficult summation, we use Monte Carlo sampling to approximate the expectation term:

NCE4

Asymptotic Analysis

NCE6

Read More