Example Scripts/Commands for Creating SFT & Reward Models for PPO/RLOO/Other Trainers

RylanSchaeffer commented 2 months ago

Feature request

Please provide example scripts in https://github.com/huggingface/trl/tree/main/examples/scripts/ppo for how to create corresponding SFT and RM checkpoints to use for PPO

Motivation

TRL helpfully provides an example script of PPOv2Trainer.

However, it does not provide an example script of how the SFT checkpoint (cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr) and Reward Model checkpoint (cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr) were created.

I tried creating my own SFT and RM checkpoints using SFTTrainer and RewardTrainer and then ran PPO using PPOv2Trainer. The SFT and RM training runs looked reasonable (stable, good test losses, good test accuracies for the RM) but the PPO runs were unstable and bad. I now do not know whether my SFT'd model is bad, whether my reward model is bad, or whether I have bad hyperparameters for PPO.

It would be amazing if scripts and/or commands were included for demonstrating how to properly create SFT and RM checkpoints to subsequently use in PPO, RLOO, other methods.

Edit: I know examples exist for SFT and RM training. I want examples that specifically demonstrate how to create SFT and RM checkpoints for PPO because my attempts thus far have not been successful.

Your contribution

I will try today to reproduce the Pythia 1B deduped checkpoints that are used. If that works, I can open a PR demonstrating how I created them. If that doesn't work, I don't know what I can do.

RylanSchaeffer commented 2 months ago

I did open discussions on the cleanrl models but I haven't heard back: https://huggingface.co/cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr/discussions/1

RylanSchaeffer commented 1 month ago

I just discovered that the default RM has no padding token nor chat template:

https://huggingface.co/cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr/blob/main/tokenizer_config.json

This is inconsistent with the corresponding default SFT model:

https://huggingface.co/cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr/blob/main/tokenizer_config.json

which also has no chat template. This makes me think that the reward model was trained differently than the SFT'd equivalent model, and that the SFT'd model is used with a chat template it wasn't trained on in the PPOv2Trainer example.

I really think we need a demonstration of how to make SFT'd models and reward models to use with PPOv2Trainer

cc: @qgallouedec

huggingface / trl