microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.6k stars 274 forks source link

[MiniLLM] mismatch between formula and implementation (gradL)_long? #264

Closed lancerts closed 3 weeks ago

lancerts commented 3 weeks ago
    def _pg_loss(
        self,
        logprobs: TensorType["batch_size", "response_size"],
        old_logprobs: TensorType["batch_size", "response_size"],
        advantages: TensorType["batch_size", "response_size"],
        mask: TensorType["batch_size", "response_size"],
        w: TensorType["batch_size", "response_size"],
    ):
        """PPO objective function.
        References:
        - https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html

        """
        n = mask.sum()

        log_ratio = (logprobs - old_logprobs) * mask
        ratio = torch.exp(log_ratio.float())            
        ratio = ratio * w

In https://github.com/microsoft/LMOps/issues/255, it states w=1, is it also 1 here?

In the paper, this part has no mention of w factor

Screenshot 2024-09-10 at 3 47 42 PM
t1101675 commented 3 weeks ago

No. $w$ is absorbed into $\rho_t(\theta)$ in the formula.

lancerts commented 3 weeks ago

cool, thanks