Open Dada-Cloudzxy opened 5 days ago
I'm also interested in this! From your paper about OpenR, I guess you will label +
when the mc_value
is larger than 0 (if I understand right), which means that this path can lead to a correct answer. But I don't think it's a nice idea, and also other work[1] uses regression to predict the reward.
OmegaPRM and Math-Shepherd both report that soft label is better? OmegaPRM和Math-Shepherd好像都报告了soft label更好?