[Chatllama] KL Divergence equation

mountinyy commented 1 year ago

Hello, I have a quick question. I know most RLHF structure use KL divergence.

https://github.com/nebuly-ai/nebullvm/blob/aad1c09ce20946294df3ec83569bad9496f58d0e/apps/accelerate/chatllama/chatllama/rlhf/trainer.py#L871-L877

However, when you see the InstructGPT paper(link), they are actually dividing (policy for RL) by (policy for SFT). (which is action_log_prob - old_action_log_probs from your code). And I saw some codes that actually follow this instructions like ColossalAI
But it seems like you are doing oppositely. I know you are converting loss to negative since you're applying RL reward for deep learning loss. But even consdiering that, you are adding those KL value, not subtracting it, whcih means it's still oppostie to action_log_prob - old_action_log_probs.

Do you have any special reason for this?

Thank you for reading :)

PierpaoloSorbellini commented 1 year ago

Hi @mountinyy, Thanks for reaching out! Yep it would agree with you on the matter. We are going to test out the changes and reach back to you!

mountinyy commented 1 year ago

@PierpaoloSorbellini Thank you for the reply! Please let me know the result :)

PierpaoloSorbellini commented 1 year ago

Hi @mountinyy I think now it's more coherent but we still having some issues with RLHF. We are digging deeper to find the root cause. thanks for the advice you should already see the change in the #306

nebuly-ai / optimate

[Chatllama] KL Divergence equation #298