not about bug happened during running, about code implementation
Who can help?
@ziyuwan
Information
[X] The official example scripts
[X] My own modified scripts
Tasks
[X] An officially supported task in the codebase (such as scrips/, ...)
[X] My own task or dataset (give details below)
Reproduction
not about bug happened during running, about code implementation
Expected behavior
The current implementation in RMRemoteCaller has a hack postprocessing function as a hack to change the policy output format into the reward model input format. And current implementation has a bug, for an output answer with the format
<|im_start|>system\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|><|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}"""
Instead of changing the policy's step_tag into the PRM's step_tag.
It also needs to be processed into any reward function format, as an example in our prm training code here.
{question} {answer}
After discussion with @morning9393 and @YanSong97, we all think we need to decouple the policy format string and the PRM format str so as to support more sophisticated input and prompting methods.
Therefore, we will first quickly update the new code and result, and then redesign the corresponding code.
System Info
not about bug happened during running, about code implementation
Who can help?
@ziyuwan
Information
Tasks
Reproduction
not about bug happened during running, about code implementation
Expected behavior
The current implementation in RMRemoteCaller has a hack postprocessing function as a hack to change the policy output format into the reward model input format. And current implementation has a bug, for an output answer with the format
Instead of changing the policy's
step_tag
into the PRM'sstep_tag
. It also needs to be processed into any reward function format, as an example in our prm training code here.After discussion with @morning9393 and @YanSong97, we all think we need to decouple the policy format string and the PRM format str so as to support more sophisticated input and prompting methods.
Therefore, we will first quickly update the new code and result, and then redesign the corresponding code.