Closed lancerts closed 2 months ago
def _pg_loss( self, logprobs: TensorType["batch_size", "response_size"], old_logprobs: TensorType["batch_size", "response_size"], advantages: TensorType["batch_size", "response_size"], mask: TensorType["batch_size", "response_size"], w: TensorType["batch_size", "response_size"], ): """PPO objective function. References: - https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html """ n = mask.sum() log_ratio = (logprobs - old_logprobs) * mask ratio = torch.exp(log_ratio.float()) ratio = ratio * w
In https://github.com/microsoft/LMOps/issues/255, it states w=1, is it also 1 here?
w=1
In the paper, this part has no mention of w factor
w
No. $w$ is absorbed into $\rho_t(\theta)$ in the formula.
cool, thanks
In https://github.com/microsoft/LMOps/issues/255, it states
w=1
, is it also 1 here?In the paper, this part has no mention of
w
factor