Small bugs about string post-processing in RMRemoteCaller

System Info

not about bug happened during running, about code implementation

Who can help?

@ziyuwan

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the codebase (such as scrips/, ...)
[X] My own task or dataset (give details below)

Reproduction

not about bug happened during running, about code implementation

Expected behavior

The current implementation in RMRemoteCaller has a hack postprocessing function as a hack to change the policy output format into the reward model input format. And current implementation has a bug, for an output answer with the format

<|im_start|>system\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|><|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}"""

Instead of changing the policy's step_tag into the PRM's step_tag. It also needs to be processed into any reward function format, as an example in our prm training code here.

{question} {answer}

After discussion with @morning9393 and @YanSong97, we all think we need to decouple the policy format string and the PRM format str so as to support more sophisticated input and prompting methods.

Therefore, we will first quickly update the new code and result, and then redesign the corresponding code.

openreasoner / openr