Closed sdsfas12 closed 2 months ago
Hi! Thanks for your interest in our work!
We are uploading the reward model checkpoints to https://huggingface.co/collections/ehzoah/efficient-exact-optimization-667995e5a7f87dff7d01a85a Running the provided training script should obtain the reward model with similar performance.
The final win rate is caluclated by averaging the win rate of each example. The core logic is something like this:
# example:
# policy_scores = [[1.0,0.8]]
# ref_scores = [[0.9, 0.6]]
wins = []
for ps, rs in zip(policy_scores, ref_scores):
samp_wins = []
for p in ps:
for r in rs:
samp_wins.append(p > r)
wins.append(sum(samp_wins) / len(samp_wins))
win_perc = sum(wins) / len(wins)
Okay, thanks for your kind answer! I'll try your checkpoints and win rate calculation code.
Hi there! Your work is exciting and inspiring. Thanks a lot!
I'm currently trying to reproduce the experiments, but the reproduced reward model win rates are much lower than those in the paper. I wonder if I'm using a wrong reward model for inference, or calculating win rates in a wrong way. Could you share more information about the reward model for inference and the specific process to calculate win rates (e.g., counting wins/loses for each comparison, or aggregating the wins/loses of all comparisons of each sample and counting wins/loses of samples)?