Open Sean-OB opened 1 month ago
I observed same problem with DPOTrainer. generate_during_eval=True
in DPOConfig produces reference outputs from current model being trained.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Below is a snippet from
ppo_trainer.py
Line permalink
When training with PEFT, we have
ref_model
the same as the base model but instead called with a context to disable the adapters:However, code to generate reference responses doesn't use this context. This leads to the reference responses logged in the tables to come from the optimized RL model rather than the reference model.
To reproduce, run any training loop with the PPOTrainer with your logging software of choice -- my setup uses WandB -- and look at the table of responses. The reference responses will be drawn from the same distribution as the model responses. Below is a screenshot from a dummy run where I rewarded the model for outputting the word "but." The reference responses should not be any different after the loop.