lucidrains / PaLM-rlhf-pytorch

Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM
MIT License
7.67k stars 668 forks source link

The loss function of reward model. #22

Open huzechuan opened 1 year ago

huzechuan commented 1 year ago

Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?

lucidrains commented 1 year ago

@huzechuan i have to admit i haven't totally digested the way they derive their reward values for training

but at the moment, even if their reward is derived from a collection of sampled responses, this repository doesn't lock you into any one method, as you can do your second step (training the reward model) from any <sequence, reward value> pair, which you define

i guess i'll have to worry about this once i build out the application for sampling from some version of the model and collecting the ratings, so do let me know in detail the optimal way they discovered. i just think there are other applications beyond text that this could be used for (rl, protein design), that does not necessarily need this sigmoid of difference approach

yangjianxin1 commented 1 year ago

Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?

I have the same confusion