lucidrains / PaLM-rlhf-pytorch

Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM
MIT License
7.7k stars 666 forks source link

Value function #35

Open tonylin52 opened 1 year ago

tonylin52 commented 1 year ago

Hi,

I am confused about the 'value function' in the instructGPT paper. In the paper, it said "As previously mentioned, for all PPO models we use a 6B RM and a 6B value function, and the latter is initialized from the former.". The reward model(RM) and value function model seem to be two seperate models. However, there are no evidences showing that value function is part of involvement of PPO RL training either in the objective function or in the other parts of the paper.

Thanks