Many thanks for creating this wonderful repo! In the blog post that accompanies this repo, there is a remark saying that "produced rewards and values of shape (B, T, 1)". However, I recall that PPO in RLHF only takes the final reward. Is there a contradiction here? And could you kindly point out which line of code would resolve this question? Thanks!
Dear authors,
Many thanks for creating this wonderful repo! In the blog post that accompanies this repo, there is a remark saying that "produced rewards and values of shape (B, T, 1)". However, I recall that PPO in RLHF only takes the final reward. Is there a contradiction here? And could you kindly point out which line of code would resolve this question? Thanks!