openreasoner / openr

OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models
https://openreasoner.github.io/
MIT License
1.07k stars 79 forks source link

Small bugs about string post-processing in RMRemoteCaller #15

Closed ziyuwan closed 1 month ago

ziyuwan commented 1 month ago

System Info

not about bug happened during running, about code implementation

Who can help?

@ziyuwan

Information

Tasks

Reproduction

not about bug happened during running, about code implementation

Expected behavior

The current implementation in RMRemoteCaller has a hack postprocessing function as a hack to change the policy output format into the reward model input format. And current implementation has a bug, for an output answer with the format

<|im_start|>system\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|><|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}"""

Instead of changing the policy's step_tag into the PRM's step_tag. It also needs to be processed into any reward function format, as an example in our prm training code here.

{question} {answer}

After discussion with @morning9393 and @YanSong97, we all think we need to decouple the policy format string and the PRM format str so as to support more sophisticated input and prompting methods.

Therefore, we will first quickly update the new code and result, and then redesign the corresponding code.