Open howardyclo opened 6 years ago
This paper proposes to augment the negative sampling process in contrastive learning with an adversarially learned conditional distribution, resulting in a negative sampler that adapts to the data distribution, training dynamics and finds harder negatives (refered to as Adversarial Contrastive Estimation (ACE)). They demonstrate the efficacy and generality of ACE on learning word embeddings, order embeddings and knowledge graph embeddings.
Contrastive learning is a sampling-based learning method that contrasting losses on observed positive examples with those on sampled fictitious negative examples, trying to minimize loss on positive ones while maximizing it on negative ones. One of the popular algorithms is Noise Contrastive Estimation (NCE) and has been applied on learning word embeddings. See the bellowing comment for Notes on Noise Contrastive Estimation.
In most general form, the objective of contrastive learning:
By the law of total expectation and the fact that given x+, the negative sampling y- is not dependent on the y+ (i.e., p(y+, y- | x) = p(y+ | x+) p(y- | x+)). Thus, rewrite Eq. 1 as:
There are two cases of l_ω(x+, y+, y-) depends on different problems:
In NCE, we make p(y-|x+) to be some unconditional p_{nce}(y-) in Eq. 2 or 3, which leads to efficient computation but sacrifices the sampling efficiency (y- might not be a hard negative example).
To remedy the above problem of a fixed unconditional negative sampler p_{nce}(y-), ACE augments it into an adversarial mixture negative sampler: λ p_{nce}(y-) + (1 − λ) gθ(y-|x+).
The Eq. 2 can be written as (ignore E_p(x+) for notational brevity):
Then, learn (ω, θ) with GAN-style training: The generator gθ is trained via policy gradient ∇θL(θ, x): where the expectation is w.r.t. p(y+|x) gθ(y-|x), and the discriminator loss l_ω(x, y+, y-) acts as the reward.
Note: The generator aims to maximize L(.), thus the reward is l_ω(x, y+, y-)
Question: Many papers did not add negative sign to the reward term. Due to the update rule of REINFORCE is gradient ascent, I think there is no negative sign to the reward term.
We can also rewrite L(ω, θ; x) when l_ω is separable-loss:
The bellowing is the derivation of L(ω, θ; x) with separable-loss l_ω:
Question: The authors ignore λ terms in the Eq. 7, which 'm not sure whether it's correct or not.
When in separable-loss case, the reward term in policy gradient ∇θL(θ, x) for updating the generator would be: −s˜(x+, y−), since only the last term depends on the generator parameter θ. (<- the original paper writes ω which is wrong!).
Note: The generator aims to maximize L(.), which is equivalent to minimize the last term -E_{gθ(y-|x)} s˜(x+, y−). Thus, the reward is −s˜(x+, y−). Again if in non-separable case, the reward term becomes l_ω(x, y+, y-).
To avoid gθ mode collapse in GAN training, they propose to add a regularization term for gθ to encourage it to have high entropy. where H(gθ(y-|x)) is the entropy of gθ, and c = log(k) is the entropy of a uniform distribution over k choices. Intuitively, R_{ent} expresses the prior that the generator should spread its mass over more than k choices for each x.
H(x) = - Σ_{i}^{k} P(x_i) log P(x_i) (by entropy definition) = - k * (1/k) log (1/k) (by uniform distribution) = -log (1/k) = -log k^(-1) = log k
Note that it is actually "max" instead of "min" in Rent equation. (I just confirmed it with the authors).
ACE samples false negatives more than NCE. Thus, two strategies are further introduced (although entropy regularization reduces this problem):
They subtract the baseline from reward to reduce the variance of policy gradient estimator with self-critical baseline method, where the baseline is:
Note: Make sampling better than greedy decoding.
Reweighting the reward term in Eq. 6 by gθ(y−|x) / p_{nce}(y−).
My understanding is that, if p_{nce} has already sampled y-, then make gθ(y−|x) sample y- less. Authors state that this is essentially "off-policy" exploration in reinforcement learning since NCE samples are not drawn according to gθ.
(ACE: mixture of p_{nce} and gθ; ADV: only gθ)
.
(Knowledge graph embedding training progress. Ent: Entropy Regularization; SC: Self-critical baseline; IW: off-policy learning)
(Knowledge graph embedding performance)
In neural language modeling, computing the probability normalization term in softmax is expensive. Thus, NCE and its simplified variants, negative sampling transform the computationally expensive learning problem into a binary classification proxy problem that use the same parameters θ to distinguish samples from the empirical distribution (i.e., (w+, c)~p*(w|c)) from samples generated the noise distribution q(w) (i.e., (w-, c)~q(w)). In practice, q(w) is a uniform unigram distribution). Note: we denote w+ is a positive example, w- is a negative example and c is a given condition (context).
Neural language modeling: Train a neural network pθ(w | c) = uθ(w, c) / Z(c) to approximate the empirical (training data) distribution p*(w|c) as closely as possible. uθ(w, c) = exp sθ(w, c), where sθ(w, c) is a differentiable function that assigns score to a word in context. Z(c) is the probability normalization term.
During training, we form the training data by sampling one positive example (w+, c) and k negative examples (w-, c) with labels 1 and 0 respectively.
Formulate the above description into a mathematical expression:
And using the definition of conditional probability:
NCE replaces the empirical distribution p*(w|c) with the model distribution pθ(w|c) and makes 2 assumptions to solve the computation problem in Z(c):
Making these assumptions, rewrite the p*(w|c) (which is replaced with pθ(w|c)) to uθ(w, c) in the above equations:
Note: In simplified variant of NCE, negative sampling, the term k × q(w) becomes to 1. It is equivalent to NCE when k = |V| and q is uniform.
The objective function of maximizing the above conditional log-likelihood of D in above equations would be:
Since the expectation of the second term is still a difficult summation, we use Monte Carlo sampling to approximate the expectation term:
Metadata