huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.78k stars 1.24k forks source link

DPO loss remains 0.6931 and reward is stuck at 0.0 #1627

Closed virt9 closed 5 months ago

virt9 commented 5 months ago

hello ! i open this issure for the problem of "DPO loss remains 0.6931 from the first step and the rewards stuck at 0.0" , the problem primary in the #1311 , but now i cant find a solution for this and no anwser can figure it out , the solution mentioned in the #1311 dont work. My model is Codellama-7B-python and my trl is the newest version, my learning rate is 1e-6 and batch_size is 1, my cuda memory is 40GB, thanks for any idea that may help me !!!

xhwang22 commented 5 months ago

Did you find a solution? I have the same bug. @virt9

virt9 commented 5 months ago

Did you find a solution? I have the same bug. @virt9

oh yes, i found i set ref_model is none and beta is 0.0, in fact bata influence the loss and it cant be 0.0, so i set it to 0.1. you can find some details in the loss computing prosess in source code.

lixiaochuan2020 commented 2 months ago

Hi! I have the same bug even if I set the beta parameter and ref_model correctly. The loss remains 0.6931, and rewards(chosen, rejected, margins) are all 0.0. Would you happen to have any other suggestions? @virt9 CleanShot 2024-08-02 at 00 29 29 image

lixiaochuan2020 commented 2 months ago

Solved it. Take a look at the solution in #1311.