Closed lancerts closed 2 months ago
old_logprobs
refers to $q{\theta}$, which will not be taken derivatives. logprobs
is $q{\theta}$ that will be taken derivatives. Computing gradients for ratio
refers to compute $\frac{\nabla q{\theta}}{q{\theta}}$. Therefore, ratio * w
refers to $\rhot(\theta) = \frac{q{\theta}}{\widetilde{p}}$, where only the $\theta$ in the numerator will be taken gradients.
Thanks for the detailed explanation.
In paper , in code
Is
old_logprobs
referring to the teacher-mixed sampling $\tilde{p}$ or its something else?