microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.84k stars 990 forks source link

Rewards in ppo seem to be recomputed many times #528

Open dwyzzy opened 1 year ago

dwyzzy commented 1 year ago

Thank you for the great work! The kl rewards seem to be computed each time calling train_rlhf(). [code]

    def train_rlhf(self, inputs):
        # train the rlhf mode here
        ### process the old outputs
        prompts = inputs['prompts']
        log_probs = inputs['logprobs']
        ref_log_probs = inputs['ref_logprobs']
        reward_score = inputs['rewards']
        values = inputs['value']
        attention_mask = inputs['attention_mask']
        seq = inputs['input_ids']

        start = prompts.size()[-1] - 1
        action_mask = attention_mask[:, 1:]

        old_values = values
        with torch.no_grad():
            old_rewards = self.compute_rewards(prompts, log_probs,
                                               ref_log_probs, reward_score,
                                               action_mask)

Both log_probs and ref_log_probs are from buffer, which means old_rewards is always same for the same episode? Did I make any mistake?

Remixa-python commented 1 year ago

You are right! In addition to the KL-reward, the advantages and returns are also recomputed! They are all "exp_data" in other frameworks, such as ColossalChat, which is pictured below. ColossalChat exp data

yaozhewei commented 1 year ago

Hi, thank you for pointing this out. Yes, they are recomputed. For now, we are using online-policy, so it is fine. We will try to fix this later