[MiniLLM] mismatch between formula and implementation (gradL)_long?

lancerts commented 2 months ago

    def _pg_loss(
        self,
        logprobs: TensorType["batch_size", "response_size"],
        old_logprobs: TensorType["batch_size", "response_size"],
        advantages: TensorType["batch_size", "response_size"],
        mask: TensorType["batch_size", "response_size"],
        w: TensorType["batch_size", "response_size"],
    ):
        """PPO objective function.
        References:
        - https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html

        """
        n = mask.sum()

        log_ratio = (logprobs - old_logprobs) * mask
        ratio = torch.exp(log_ratio.float())            
        ratio = ratio * w

In https://github.com/microsoft/LMOps/issues/255, it states w=1, is it also 1 here?

In the paper, this part has no mention of w factor

t1101675 commented 2 months ago

No. $w$ is absorbed into $\rho_t(\theta)$ in the formula.

lancerts commented 2 months ago

cool, thanks

microsoft / LMOps

[MiniLLM] mismatch between formula and implementation (gradL)_long? #264