huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.21k stars 1.3k forks source link

Scaling mismatch between model rewards and KL regularization in PPO trainer #860

Closed freQuensy23-coder closed 1 year ago

freQuensy23-coder commented 1 year ago

In the Proximal Policy Optimization (PPO) trainer implementation (specifically in this file), the total reward function calculates the weighted sum of the rewards generated by the user-defined reward model and the KL divergence between the current model and a reference model. However, the reward values from the model and the KL divergence term may operate on vastly different scales. The user-defined rewards could potentially take on very small values (e.g. 0.0001) or very large values, whereas the KL divergence term will likely fall in a standard range.

This scaling mismatch presents a challenge in training a performant policy. If the user-defined rewards are too small relative to the KL term, the policy may not learn well as it will be heavily regulated by the KL penalty. Conversely, very large user-defined rewards could dominate the loss, causing the policy to diverge excessively from the reference without being sufficiently constrained by the KL term.

Potential solutions could include:

lvwerra commented 1 year ago

Hi @freQuensy23-coder

Indeed, reward scaling plays an important role, and even when scaling KL properly the user defined reward should ideally also be scaled properly.

In any case, the KL \beta parameter is already implemented and is called init_kl_coef and can be adjusted via the config: https://github.com/huggingface/trl/blob/3ef21a24e7df53d0e0e6fe26b448b94ed3ec7cda/trl/trainer/ppo_trainer.py#L291-L294

Hope this helps!

younesbelkada commented 1 year ago

Closing for now! let us know if you have more questions