openpsi-project / ReaLHF

Super-Efficient RLHF Training of LLMs with Parameter Reallocation
Apache License 2.0
104 stars 4 forks source link

grpo has not prm #79

Open yiyepiaoling0715 opened 2 weeks ago

yiyepiaoling0715 commented 2 weeks ago

grpo has the step level reward deal,also known as progress reward model,but not seen in the code, can you tell the reason or how to use step level deal ? thanks

garrett4wade commented 1 week ago

Sorry for the late reply.

PRM or ORM are similar. The current code here simply extracts scores at the end of each sentence. You can modify the model interface to utilize scores at all positions (or at step level, such as scores outputed at all "comma" tokens), just like how we use values in PPO.

This example may also be helpful.

We'd like to help you if you encounter any issues during implementation.