lucidrains / PaLM-rlhf-pytorch

Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM
MIT License
7.7k stars 666 forks source link

Should critic's input be prompt only? #57

Open ginward opened 11 months ago

ginward commented 11 months ago

In the PPO implementation, it seems that the critic model considers both prompt and generated actions as the input (if pooled is true, then generated actions only). However, if we see prompt as S_t and prompt with action as S_t+T, shouldn't the value function be V(S_t) but not V(S_t+T)?

In other words, when calculating the advantage function, shouldn't our value function be the average reward for a prompt?