`PPOv2Trainer` and `RLOOTrainer`: Remove the implicit assumption that the reward model & policy model share the same tokenizer

RylanSchaeffer commented 2 months ago

Feature request

Remove the implicit assumption in PPOv2Trainer and RLOOTrainer that the reward model & policy model share the same tokenizer

Motivation

Currently, as I understand, PPOv2Trainer and RLOOTrainer both assume that the reward model and the policy model share the same tokenizer. This is oftentimes not the case; for instance, I want to try different reward models from RewardBench and these are often based on different language models with different tokenizers.

This implicit assumption should be removed.

To see where this behavior pops up, please see this for an example: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py#L599-L601

Note that the raw tokens of the policy model are passed directly to the reward model. These tokens are not meaningful if the reward model does not share the same tokenizer.

Your contribution

If the developers agree, I'd be happy to discuss this change with them and how to best implement it.

RylanSchaeffer commented 2 months ago

@qgallouedec if you could please comment, once we have a consensus of (1) whether this is indeed a problem and (2) what a reasonable solution looks like, I can work on a PR

RylanSchaeffer commented 2 months ago

@kashif can I ask you to please weigh in on this? I want to know whether you agree with this proposed change, and if so, what a solution might look like. I'd be happy to (help) implement it.

qgallouedec commented 3 weeks ago

This assumption is made for every trainer that use a reward model, so it also include Online DPO and its variants XPO, Nash-MD. This would be a great improvement.

kashif commented 3 weeks ago

right would be a good improvement i believe...

huggingface / trl