Closed pender closed 6 years ago
@pender Good question. The idea of these scalers is that they change very slowly from batch to batch. And, by 1M episodes, they hardly budge at all from episode-to-episdoe. If you learn that a certain observation + action vector leads to poor rewards, you don't want batch norm to scale this vector to a different spot in future episodes depending on what else is in the batch.
Also reinforcement learning agents are not trained with a stationary distribution. As the agent learns, they explore different areas and stop visiting other areas.
I wish I had documented it, but I did try normalization on a per batch or even per 10 batch basis. The learning performance was awful. All that said, I didn't implement batch normalization and I don't want to discourage you from trying it.
Interesting, thanks for the explanation. I wonder if you could achieve the right level of flexibility in the rolling average of TF BatchNorm layer by adjusting the momentum argument ... maybe I'll give it a shot. Thanks again.
@pender
That'd be cool if you gave it a shot. Looking at my notes, the way I normalized when I tried was per feature. I'm curious how true batch norm does.
Hi @pat-coady -- I was wondering why you use a custom Scaler python object instead of standard batchnorm (e.g. the tensorflow kind)? Wouldn't sticking a batchnorm layer onto the front of the policy net achieve the same thing, require less code and be compatible with TF savers? Sorry if I am misunderstanding!