Open mountinyy opened 1 year ago
Hi @mountinyy, Thanks for reaching out! Yep it would agree with you on the matter. We are going to test out the changes and reach back to you!
@PierpaoloSorbellini Thank you for the reply! Please let me know the result :)
Hi @mountinyy I think now it's more coherent but we still having some issues with RLHF. We are digging deeper to find the root cause. thanks for the advice you should already see the change in the #306
Hello, I have a quick question. I know most RLHF structure use KL divergence.
https://github.com/nebuly-ai/nebullvm/blob/aad1c09ce20946294df3ec83569bad9496f58d0e/apps/accelerate/chatllama/chatllama/rlhf/trainer.py#L871-L877
However, when you see the InstructGPT paper(link), they are actually dividing (policy for RL) by (policy for SFT). (which is action_log_prob - old_action_log_probs from your code). And I saw some codes that actually follow this instructions like ColossalAI
But it seems like you are doing oppositely. I know you are converting loss to negative since you're applying RL reward for deep learning loss. But even consdiering that, you are adding those KL value, not subtracting it, whcih means it's still oppostie to action_log_prob - old_action_log_probs.
Do you have any special reason for this?
Thank you for reading :)