Open RylanSchaeffer opened 2 months ago
@qgallouedec if you could please comment, once we have a consensus of (1) whether this is indeed a problem and (2) what a reasonable solution looks like, I can work on a PR
@kashif can I ask you to please weigh in on this? I want to know whether you agree with this proposed change, and if so, what a solution might look like. I'd be happy to (help) implement it.
This assumption is made for every trainer that use a reward model, so it also include Online DPO and its variants XPO, Nash-MD. This would be a great improvement.
right would be a good improvement i believe...
Feature request
Remove the implicit assumption in
PPOv2Trainer
andRLOOTrainer
that the reward model & policy model share the same tokenizerMotivation
Currently, as I understand,
PPOv2Trainer
andRLOOTrainer
both assume that the reward model and the policy model share the same tokenizer. This is oftentimes not the case; for instance, I want to try different reward models from RewardBench and these are often based on different language models with different tokenizers.This implicit assumption should be removed.
To see where this behavior pops up, please see this for an example: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py#L599-L601
Note that the raw tokens of the policy model are passed directly to the reward model. These tokens are not meaningful if the reward model does not share the same tokenizer.
Your contribution
If the developers agree, I'd be happy to discuss this change with them and how to best implement it.