vwxyzjn / lm-human-preference-details

RLHF implementation details of OAI's 2019 codebase
MIT License
145 stars 7 forks source link

Reward Shape #28

Closed QiyaoWei closed 7 months ago

QiyaoWei commented 7 months ago

Dear authors,

Many thanks for creating this wonderful repo! In the blog post that accompanies this repo, there is a remark saying that "produced rewards and values of shape (B, T, 1)". However, I recall that PPO in RLHF only takes the final reward. Is there a contradiction here? And could you kindly point out which line of code would resolve this question? Thanks!

vwxyzjn commented 7 months ago

We do take the reward from the last token. https://github.com/vwxyzjn/lm-human-preference-details/blob/ccc19538e817e98a60d3253242ac15e2a562cb49/lm_human_preference_details/train_reward_accelerate.py#L302