mrahtz / learning-from-human-preferences

Reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences"
MIT License
301 stars 67 forks source link

Doubt on normalizing rewards #5

Closed SestoAle closed 4 years ago

SestoAle commented 4 years ago

Hi,

I have a doubt about how often you (or the paper) normalize the reward:

do you save all the rewards that output from the model in the r_norm object, both during training and test? Or do you occasionally empty r_norm?

Thank you!

mrahtz commented 4 years ago

Hi there,

Yeah, at the moment all reward values generated by the reward predictor are used to update the reward normalisation. r_norm is never emptied; it reports the mean and standard deviation of all values it's ever seen.

Admittedly this could lead to a situation where during trainining the reward predictor changes in such a way that it starts outputting values with a different mean and scale which wouldn't be well-accounted for by normalisation based on all previous values. It didn't seem to be a problem in the limited scope of this reproduction, but I think I did see that happening in a different project which also used this code. If you're worried about it, I would log the mean and standard deviation of, say, the past 10,000 reward values spit out by the reward predictor, and compare that to the mean and standard deviation reported by r_norm. If you see a big difference, try switching to normalisation based on only a window of recent values (see e.g. r_norm_limited in https://github.com/HumanCompatibleAI/interactive-behaviour-design/blob/master/drlhp/reward_predictor.py).

(Also, there isn't any distinction between training and testing in this implementation - there's only a single environment, and we measure performance while training in that environment.)

Hope this helps. I'll close this for now, but if you have any other questions feel free to reply.