Scaling mismatch between model rewards and KL regularization in PPO trainer

freQuensy23-coder commented 1 year ago

In the Proximal Policy Optimization (PPO) trainer implementation (specifically in this file), the total reward function calculates the weighted sum of the rewards generated by the user-defined reward model and the KL divergence between the current model and a reference model. However, the reward values from the model and the KL divergence term may operate on vastly different scales. The user-defined rewards could potentially take on very small values (e.g. 0.0001) or very large values, whereas the KL divergence term will likely fall in a standard range.

This scaling mismatch presents a challenge in training a performant policy. If the user-defined rewards are too small relative to the KL term, the policy may not learn well as it will be heavily regulated by the KL penalty. Conversely, very large user-defined rewards could dominate the loss, causing the policy to diverge excessively from the reference without being sufficiently constrained by the KL term.

Potential solutions could include:

Automatically normalizing or scaling the user-defined rewards and KL term to be on a compatible scale.
Introducing a hyperparameter \beta to weight the KL penalty term (i.e. reward = scores + \beta * KL). This would give users control to tune the balance based on their reward model's scale. The documentation should note this scaling issue and the importance of \beta tuning. Overall this scaling mismatch merits consideration, as it could lead to unexpected behaviors during PPO policy training in trl. Matching the scales of the different loss components would improve training stability and user experience.

lvwerra commented 1 year ago

Hi @freQuensy23-coder

Indeed, reward scaling plays an important role, and even when scaling KL properly the user defined reward should ideally also be scaled properly.

In any case, the KL \beta parameter is already implemented and is called init_kl_coef and can be adjusted via the config: https://github.com/huggingface/trl/blob/3ef21a24e7df53d0e0e6fe26b448b94ed3ec7cda/trl/trainer/ppo_trainer.py#L291-L294

Hope this helps!

younesbelkada commented 1 year ago

Closing for now! let us know if you have more questions

huggingface / trl

Scaling mismatch between model rewards and KL regularization in PPO trainer #860