Doubt on normalizing rewards

Hi there,

Yeah, at the moment all reward values generated by the reward predictor are used to update the reward normalisation. r_norm is never emptied; it reports the mean and standard deviation of all values it's ever seen.

Admittedly this could lead to a situation where during trainining the reward predictor changes in such a way that it starts outputting values with a different mean and scale which wouldn't be well-accounted for by normalisation based on all previous values. It didn't seem to be a problem in the limited scope of this reproduction, but I think I did see that happening in a different project which also used this code. If you're worried about it, I would log the mean and standard deviation of, say, the past 10,000 reward values spit out by the reward predictor, and compare that to the mean and standard deviation reported by r_norm. If you see a big difference, try switching to normalisation based on only a window of recent values (see e.g. r_norm_limited in https://github.com/HumanCompatibleAI/interactive-behaviour-design/blob/master/drlhp/reward_predictor.py).

(Also, there isn't any distinction between training and testing in this implementation - there's only a single environment, and we measure performance while training in that environment.)

Hope this helps. I'll close this for now, but if you have any other questions feel free to reply.

mrahtz / learning-from-human-preferences

Doubt on normalizing rewards #5