huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.24k stars 1.16k forks source link

Reference model alignment with the current policy #1112

Closed sajastu closed 7 months ago

sajastu commented 8 months ago

Hello,

I've been exploring the implementation of Proximal Policy Optimization (PPO) in the ppo_trainer.py file, and I have a query regarding the handling of the reference model (old policy) in relation to the current policy model.

In standard PPO implementations, it's a common practice to periodically update the reference model so that it aligns with the current state of the policy model. This approach ensures that the policy model's deviations are measured against a relatively recent version of itself, aiding in stabilizing the training process and preventing the policy from diverging too drastically or quickly. However, upon reviewing the PPOTrainer class in the provided code, I couldn't locate the segment where this periodic alignment of the old policy (reference model) with the current policy (model itself) is performed.

I'm wondering if this might be an intentional design choice in this particular implementation, or if there's a possibility that I might have overlooked some aspect of the process. Could it be that the updating of the reference model is handled implicitly in another part of the codebase, or is it the case that the reference model remains static throughout the training process in this implementation?

Thanks,

lvwerra commented 8 months ago

I believe when fine-tuning the pretrained LLMs the reference model is already very strong and the policy model should never deviate to far from it. This is very different from pure RL and I think that's the reason/intuition why the reference model is not updated in this setting usually.

sajastu commented 8 months ago

@lvwerra thanks so much for your insights; it does make sense in the case of LLMs. This actually raises a compelling question in my mind: what about scenarios involving less potent language models, specifically those of a smaller scale, like those with merely hundreds of millions of pre-trained parameters?

If we can make sure this smaller LM gets expertise on the downstream task (possibly through metrics in the validation steps), is it a good idea to make this update/alignment with the current policy happen?

Thanks in advance,

lvwerra commented 8 months ago

I think the main reason is that one doesn't really expect the model (even a small one) to get better at language modeling than it was after pretraining. The PPO step in general is just to align better with some preference. If you think it's holding the model back you can always weaken the KL term which will decrease the coupling between the reference and active model.

Of course you are free to test it out yourself, maybe this is all wrong :)

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.