Question about the output of reward model in RLHF？

gauss-clb commented 10 months ago

Why reward model use mean(values[:,:-1], dim=1) as output？

values = self.value_head(last_hidden_states)[:, :-1]
value = values.mean(dim=1).squeeze(1)    # ensure shape is (B)

https://github.com/hpcaitech/ColossalAI/blob/d20dceb9a3d1bdcb2376201220f49fca7c7c1be9/applications/Chat/coati/models/base/reward_model.py#L39

The input may be like <bos_token_id> <question_token_id_1> ... <question_token_id_n> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>，so I think values should use self.value_head(last_hidden_states)[:, :index_of_eos + 1].
The index -1 may represent pad_token，this output is meaningless.

And I think whether use the output of last token instead of mean value for input sequence? i.e. value = self.value_head(last_hidden_states)[:, index_of_eos] (for batch_size>1, use torch.gather instead) rather than

# for batch_size=1
values = self.value_head(last_hidden_states)[:, :index_of_eos + 1]
value = values.mean(dim=1).squeeze(1)    # ensure shape is (B)

There is an another strange problem，in RL stage, the input of reward model may be like <bos_token_id> <question_token_id_1> ... <question_token_id_n> <pad_token_id> ... <pad_token_id> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>

The input form is different with that used in reward model training process, and is this the cause of unstable training?

CWHer commented 10 months ago

And I think whether use the output of last token instead of mean value for input sequence? i.e. value = self.value_head(last_hidden_states)[:, index_of_eos] (for batch_size>1, use torch.gather instead) rather than

Thanks for the suggestion. We are planning to change it in the upcoming PR #4471.

There is an another strange problem，in RL stage, the input of reward model may be like <bos_token_id> <question_token_id_1> ... <question_token_id_n> <pad_token_id> ... <pad_token_id> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>

The input form is different with that used in reward model training process, and is this the cause of unstable training?

As for this part, I do not quite understand where the difference lies in. Could you explain it?

gauss-clb commented 10 months ago

In the training process of reward model, the input sequences of questions and answers(rejected or chosen) are concatenated. In other word, there is no pad_token between question and answer. But in RL stage，the actor first reads the batch of questions and then generates answers. So the question tokens will be added pad_token due to batch, if we use right padding, there are padding token between question tokens and answer tokens. And we know the output of actor is sensitive to input, especially the padding tokens between questions and answers. An alternative solution may use left padding in both reward model and RL model training process. If using left padding, the mean of reward should eliminate the padding before question tokens. In my opinion, the output of reward model represents value function of state(current token and before), so the mean of V(s) may be unnecessary. We just choose the output of eos_token(V(eos_token)) which represents the expected reward for the entire sequence.

I have fixed some bugs, and the mean reward value became more stable. But there is another problem that the the mean reward value always oscillate around the zero axis, do you have any idea? Whether is it related to the loss function used in the reward model training phase? (the mean of value_head(last_hidden_states)[:, -1]).

CWHer commented 10 months ago

So the question tokens will be added pad_token due to batch, if we use right padding, there are padding token between question tokens and answer tokens. And we know the output of actor is sensitive to input, especially the padding tokens between questions and answers.

I think attention_mask can solve this issue. 🤔

In my opinion, the output of reward model represents value function of state(current token and before), so the mean of V(s) may be unnecessary. We just choose the output of eos_token(V(eos_token)) which represents the expected reward for the entire sequence.

Yes, this is exactly what we plan to adapt in the next PR :)

I have fixed some bugs, and the mean reward value became more stable. But there is another problem that the the mean reward value always oscillate around the zero axis, do you have any idea? Whether is it related to the loss function used in the reward model training phase? (the mean of value_head(last_hidden_states)[:, -1]).

We are planning to reproduce the training procedure of sft, rm and ppo. However, this may take several days before we can share the training results.

gauss-clb commented 10 months ago

@CWHer https://github.com/hpcaitech/ColossalAI/discussions/4476, could you answer this question? In my opinion, the advantage function is usually estimated by GAE (Generalized advantage estimation).

CWHer commented 10 months ago

TL; DR: OpenAI DOES NOT release their implementation, and various implementations of ppo exist in different libraries. We are working on supporting another version of reward calculation (i.e., with GAE).

For more details, refer to #3374, #4125, and #4309.

CWHer commented 10 months ago

You can also find a prototype of ppo with GAE in this branch.

However, the performance is still unverified, and it may contain bugs.

gauss-clb commented 10 months ago

I just use batch_size=1, and don't use padding in the reward model training process. In RL stage, the mean of value became more stable in the start of training process, so the reward of padding_token is meaningless.

Before changing, the mean reward is like as follow,

hpcaitech / ColossalAI

Question about the output of reward model in RLHF？ #4475