Closed liutianlin0121 closed 1 year ago
Oh nice! You're absolutely spot on. We should definitely match this implementation detail. I have just created a PR to fix this issue and will be running experiments to examine its effect.
Hey @liutianlin0121 lets keep the issue open until we have benchmark :)
No noticeable difference. We are good! Closing the issue.
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=openrlbenchmark&wpn=lm-human-preferences&xaxis=_step&ceik=task_id&cen=task.value.policy.initial_model&metrics=ppo/objective/score&metrics=ppo/objective/kl&metrics=ppo/objective/entropy&metrics=ppo/objective/score_total&metrics=ppo/objective/kl_coef&metrics=ppo/ppo/loss/total&metrics=ppo/ppo/loss/value&metrics=ppo/ppo/loss/policy&metrics=ppo/ppo/policy/clipfrac&metrics=ppo/ppo/policy/entropy&metrics=ppo/ppo/returns/mean&metrics=ppo/ppo/policy/approxkl&metrics=ppo/ppo/val/clipfrac&metrics=ppo/ppo/val/error&metrics=ppo/ppo/val/mean&metrics=ppo/ppo/returns/var&metrics=ppo/ppo/val/vpred' \
'124M' \
--filters '?we=openrlbenchmark&wpn=lm_human_preference_details&xaxis=_step&ceik=rewards.value.label_dataset&cen=exp_name&metrics=objective/scores&metrics=objective/kl&metrics=objective/entropy&metrics=objective/score_total&metrics=objective/kl_coef&metrics=ppo/loss/total&metrics=ppo/loss/value&metrics=ppo/loss/policy_avg&metrics=ppo/policy/clipfrac_avg&metrics=ppo/policy/entropy_avg&metrics=ppo/returns/mean&metrics=ppo/policy/approxkl_avg&metrics=ppo/val/clipfrac_avg&metrics=ppo/val/error&metrics=ppo/val/mean&metrics=ppo/returns/var&metrics=ppo/val/vpred' \
'train_policy_accelerate?tag=v0.1.0-68-g2f3aa38&tag=tf_adam&tag=gpt2&cl=tf_adam,gpt2' \
'train_policy_accelerate?tag=v0.1.0-58-g4f42012&tag=tf_adam&tag=gpt2&cl=tf_adam,gpt2 (before PR-10)' \
--env-ids sentiment descriptiveness \
--env-ids sentiment/offline_5k.json descriptiveness/offline_5k.json \
--no-check-empty-runs \
--pc.ncols 6 \
--pc.ncols-legend 1 \
--output-filename static/0compare \
--scan-history
Hi Costa!
A quick question about
normalize_after
for reward normalization:The current implementation seems to normalize the gain and bias of the reward model using the reward model's backbone (logit-returning language model). Specifically, for both
normalize_before
andnormalize_after
, theaccelerator.unwrap_model(reward_model).pretrained_model
is used to generate responses.However, according to OAI's paper and implementation, it seems they normalize the reward model based on the responses generated from the pretrained model. For
normalize_before
, the pretrained model is the same as reward model's backbone. But fornormalize_after
, differences might arise becausereward_model.pretrained_model
could be updated during reward learning.Using the notation of the paper, the responses for normalization come from the fixed pre-trained language model $\rho$; see the text after Equation (1). In their code, they use
ref_policy
(link) for bothnormalize_before
andnormalize_after
, and it seemsref_policy
doesn't update during reward learning.Thought this detail might interest you! Nevertheless, with a low learning rate and just 1 epoch in reward learning, the practical difference can be small, as the parameters of reward model's backbone may not deviate significantly from the initialization.