pat-coady / trpo

Trust Region Policy Optimization with TensorFlow and OpenAI Gym
https://learningai.io/projects/2017/07/28/ai-gym-workout.html
MIT License
361 stars 106 forks source link

Scaler vs. BatchNorm #10

Closed pender closed 6 years ago

pender commented 6 years ago

Hi @pat-coady -- I was wondering why you use a custom Scaler python object instead of standard batchnorm (e.g. the tensorflow kind)? Wouldn't sticking a batchnorm layer onto the front of the policy net achieve the same thing, require less code and be compatible with TF savers? Sorry if I am misunderstanding!

pat-coady commented 6 years ago

@pender Good question. The idea of these scalers is that they change very slowly from batch to batch. And, by 1M episodes, they hardly budge at all from episode-to-episdoe. If you learn that a certain observation + action vector leads to poor rewards, you don't want batch norm to scale this vector to a different spot in future episodes depending on what else is in the batch.

Also reinforcement learning agents are not trained with a stationary distribution. As the agent learns, they explore different areas and stop visiting other areas.

I wish I had documented it, but I did try normalization on a per batch or even per 10 batch basis. The learning performance was awful. All that said, I didn't implement batch normalization and I don't want to discourage you from trying it.

pender commented 6 years ago

Interesting, thanks for the explanation. I wonder if you could achieve the right level of flexibility in the rolling average of TF BatchNorm layer by adjusting the momentum argument ... maybe I'll give it a shot. Thanks again.

pat-coady commented 6 years ago

@pender

That'd be cool if you gave it a shot. Looking at my notes, the way I normalized when I tried was per feature. I'm curious how true batch norm does.