Closed RylanSchaeffer closed 1 month ago
Oh thank you for asking, this is actually a good point. I was doing reward model centering after training the RMs. You can see this here: https://github.com/tlc4418/llm_optimization/blob/main/src/reward_modeling/training/trainer_rm.py#L334-L345.
But actually this requires a small edit to the Open-Assistant codebase which I had made locally. You need to add these mean/std fields to the Reward Model model class:
config_dict = config.to_dict()
self.mean = config_dict.get("mean", 0)
self.std = config_dict.get("std", 1)
and then use these values to recenter it during the forward pass, for example adding the following line to the forward() method before returning:
logits = (logits - self.mean) / self.std
Awesome - thank you for clarifying!
If I can confirm one detail, for the reward model ensembles, was any centering / scaling / standardizing or any other transformation applied to the models' rewards?
Looking at https://github.com/tlc4418/llm_optimization/blob/main/src/bon/ensemble_rm.py#L23-L55, it looks like the answer is no, but I want to check :)