Open RylanSchaeffer opened 2 months ago
I did open discussions on the cleanrl
models but I haven't heard back: https://huggingface.co/cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr/discussions/1
I just discovered that the default RM has no padding token nor chat template:
This is inconsistent with the corresponding default SFT model:
which also has no chat template. This makes me think that the reward model was trained differently than the SFT'd equivalent model, and that the SFT'd model is used with a chat template it wasn't trained on in the PPOv2Trainer example.
I really think we need a demonstration of how to make SFT'd models and reward models to use with PPOv2Trainer
cc: @qgallouedec
Feature request
Please provide example scripts in https://github.com/huggingface/trl/tree/main/examples/scripts/ppo for how to create corresponding SFT and RM checkpoints to use for PPO
Motivation
TRL helpfully provides an example script of PPOv2Trainer.
However, it does not provide an example script of how the SFT checkpoint (
cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr
) and Reward Model checkpoint (cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr
) were created.I tried creating my own SFT and RM checkpoints using
SFTTrainer
andRewardTrainer
and then ran PPO usingPPOv2Trainer
. The SFT and RM training runs looked reasonable (stable, good test losses, good test accuracies for the RM) but the PPO runs were unstable and bad. I now do not know whether my SFT'd model is bad, whether my reward model is bad, or whether I have bad hyperparameters for PPO.It would be amazing if scripts and/or commands were included for demonstrating how to properly create SFT and RM checkpoints to subsequently use in PPO, RLOO, other methods.
Edit: I know examples exist for SFT and RM training. I want examples that specifically demonstrate how to create SFT and RM checkpoints for PPO because my attempts thus far have not been successful.
Your contribution
I will try today to reproduce the Pythia 1B deduped checkpoints that are used. If that works, I can open a PR demonstrating how I created them. If that doesn't work, I don't know what I can do.