microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.98k stars 1.01k forks source link

Sequence truncation mistake in step3 training #517

Closed puyuanOT closed 1 year ago

puyuanOT commented 1 year ago

In the _generate_sequence of ppo_trainer.py, we have

        with torch.no_grad():
            seq = self.actor_model.module.generate(prompts,
                                                   attention_mask=mask,
                                                   max_length=max_min_length,
                                                   min_length=max_min_length)

        # Filter out seq with no answers (or very short). This happens when users directly use the pre-training ckpt without supervised finetuning
        # NOTE: this will causes each GPU has different number of examples
        batch_size = seq.shape[0]
        prompt_length = prompts.shape[1]
        ans = seq[:, prompt_length:]

        self.prompt_length = prompt_length
        valid_ans_len = (ans != self.tokenizer.pad_token_id).sum(dim=-1)
        out_seq = []
        for i in range(batch_size):
            if valid_ans_len[
                    i] <= 1:  # if the answer is shorter than 1 token, drop it
                continue
            else:
                out_seq.append(seq[i:i + 1])
        out_seq = torch.cat(out_seq, dim=0)  # concate output in the batch dim

If I understand correctly, we only want to keep the model response and this is done by truncating each sequence in seq variable, which has the shape bs * seq_len. However, we have out_seq.append(seq[i:i + 1]) in the last for loop, which keeps the whole sequence instead.

However, if the intention is instead keeping both prompts and response into the out_seq, we would have a contradiction with step_2_training, where only responses are passed to the reward model.

xiaoxiawu-microsoft commented 1 year ago

@puyuanOT Thanks for your question. You are right that we keep the whole sequence in the code on the generation. And you are also right that in the step2 reward training, we use the response only. Now if you look at how we calculate the loss, you will see that here we only apply the response.

https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py#L169

actor_loss = self.actor_loss_fn(actor_log_prob[:, start:],
                                        log_probs[:, start:], advantages,
                                        action_mask[:, start:])

where the start is the length of the prompt-size

For critic models, similarly it's here: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py#L177

Let me know if you have further questions :)