Closed freQuensy23-coder closed 1 year ago
Hi @freQuensy23-coder
Indeed, reward scaling plays an important role, and even when scaling KL properly the user defined reward should ideally also be scaled properly.
In any case, the KL \beta
parameter is already implemented and is called init_kl_coef
and can be adjusted via the config: https://github.com/huggingface/trl/blob/3ef21a24e7df53d0e0e6fe26b448b94ed3ec7cda/trl/trainer/ppo_trainer.py#L291-L294
Hope this helps!
Closing for now! let us know if you have more questions
In the Proximal Policy Optimization (PPO) trainer implementation (specifically in this file), the total reward function calculates the weighted sum of the rewards generated by the user-defined reward model and the KL divergence between the current model and a reference model. However, the reward values from the model and the KL divergence term may operate on vastly different scales. The user-defined rewards could potentially take on very small values (e.g. 0.0001) or very large values, whereas the KL divergence term will likely fall in a standard range.
This scaling mismatch presents a challenge in training a performant policy. If the user-defined rewards are too small relative to the KL term, the policy may not learn well as it will be heavily regulated by the KL penalty. Conversely, very large user-defined rewards could dominate the loss, causing the policy to diverge excessively from the reference without being sufficiently constrained by the KL term.
Potential solutions could include:
reward = scores + \beta * KL
). This would give users control to tune the balance based on their reward model's scale. The documentation should note this scaling issue and the importance of \beta tuning. Overall this scaling mismatch merits consideration, as it could lead to unexpected behaviors during PPO policy training in trl. Matching the scales of the different loss components would improve training stability and user experience.