Open gauss-clb opened 10 months ago
And I think whether use the output of last token instead of mean value for input sequence? i.e.
value = self.value_head(last_hidden_states)[:, index_of_eos]
(for batch_size>1, use torch.gather instead) rather than
Thanks for the suggestion. We are planning to change it in the upcoming PR #4471.
There is an another strange problem,in RL stage, the input of reward model may be like
<bos_token_id> <question_token_id_1> ... <question_token_id_n> <pad_token_id> ... <pad_token_id> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>
The input form is different with that used in reward model training process, and is this the cause of unstable training?
As for this part, I do not quite understand where the difference lies in. Could you explain it?
In the training process of reward model, the input sequences of questions and answers(rejected or chosen) are concatenated. In other word, there is no pad_token between question and answer. But in RL stage,the actor first reads the batch of questions and then generates answers. So the question tokens will be added pad_token due to batch, if we use right padding, there are padding token between question tokens and answer tokens. And we know the output of actor is sensitive to input, especially the padding tokens between questions and answers. An alternative solution may use left padding in both reward model and RL model training process. If using left padding, the mean of reward should eliminate the padding before question tokens. In my opinion, the output of reward model represents value function of state(current token and before), so the mean of V(s) may be unnecessary. We just choose the output of eos_token(V(eos_token)) which represents the expected reward for the entire sequence.
I have fixed some bugs, and the mean reward value became more stable. But there is another problem that the the mean reward value always oscillate around the zero axis, do you have any idea? Whether is it related to the loss function used in the reward model training phase? (the mean of value_head(last_hidden_states)[:, -1]).
So the question tokens will be added pad_token due to batch, if we use right padding, there are padding token between question tokens and answer tokens. And we know the output of actor is sensitive to input, especially the padding tokens between questions and answers.
I think attention_mask
can solve this issue. 🤔
In my opinion, the output of reward model represents value function of state(current token and before), so the mean of V(s) may be unnecessary. We just choose the output of eos_token(V(eos_token)) which represents the expected reward for the entire sequence.
Yes, this is exactly what we plan to adapt in the next PR :)
I have fixed some bugs, and the mean reward value became more stable. But there is another problem that the the mean reward value always oscillate around the zero axis, do you have any idea? Whether is it related to the loss function used in the reward model training phase? (the mean of value_head(last_hidden_states)[:, -1]).
We are planning to reproduce the training procedure of sft
, rm
and ppo
. However, this may take several days before we can share the training results.
@CWHer https://github.com/hpcaitech/ColossalAI/discussions/4476, could you answer this question? In my opinion, the advantage function is usually estimated by GAE (Generalized advantage estimation).
TL; DR: OpenAI DOES NOT release their implementation, and various implementations of ppo exist in different libraries. We are working on supporting another version of reward calculation (i.e., with GAE).
For more details, refer to #3374, #4125, and #4309.
You can also find a prototype of ppo with GAE in this branch.
However, the performance is still unverified, and it may contain bugs.
I just use batch_size=1, and don't use padding in the reward model training process. In RL stage, the mean of value became more stable in the start of training process, so the reward of padding_token is meaningless.
Before changing, the mean reward is like as follow,
Why reward model use
mean(values[:,:-1], dim=1)
as output?https://github.com/hpcaitech/ColossalAI/blob/d20dceb9a3d1bdcb2376201220f49fca7c7c1be9/applications/Chat/coati/models/base/reward_model.py#L39
The input may be like
<bos_token_id> <question_token_id_1> ... <question_token_id_n> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>
,so I think values should useself.value_head(last_hidden_states)[:, :index_of_eos + 1]
.The index
-1
may represent pad_token,this output is meaningless.And I think whether use the output of last token instead of mean value for input sequence? i.e.
value = self.value_head(last_hidden_states)[:, index_of_eos]
(for batch_size>1, use torch.gather instead) rather thanThere is an another strange problem,in RL stage, the input of reward model may be like
<bos_token_id> <question_token_id_1> ... <question_token_id_n> <pad_token_id> ... <pad_token_id> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>
The input form is different with that used in reward model training process, and is this the cause of unstable training?