tlc4418 / llm_optimization

A repo for RLHF training and BoN over LLMs, with support for reward model ensembles.
https://arxiv.org/abs/2310.02743
MIT License
26 stars 1 forks source link

Clarification: No Centering / Scaling / Standardizing of Ensembles' Rewards? #12

Closed RylanSchaeffer closed 1 month ago

RylanSchaeffer commented 1 month ago

If I can confirm one detail, for the reward model ensembles, was any centering / scaling / standardizing or any other transformation applied to the models' rewards?

Looking at https://github.com/tlc4418/llm_optimization/blob/main/src/bon/ensemble_rm.py#L23-L55, it looks like the answer is no, but I want to check :)

tlc4418 commented 1 month ago

Oh thank you for asking, this is actually a good point. I was doing reward model centering after training the RMs. You can see this here: https://github.com/tlc4418/llm_optimization/blob/main/src/reward_modeling/training/trainer_rm.py#L334-L345.

But actually this requires a small edit to the Open-Assistant codebase which I had made locally. You need to add these mean/std fields to the Reward Model model class:

config_dict = config.to_dict()
self.mean = config_dict.get("mean", 0)
self.std = config_dict.get("std", 1)

and then use these values to recenter it during the forward pass, for example adding the following line to the forward() method before returning:

logits = (logits - self.mean) / self.std
RylanSchaeffer commented 1 month ago

Awesome - thank you for clarifying!