ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.62k stars 827 forks source link

VecNormalize only for 1-D observations #115

Open timmeinhardt opened 6 years ago

timmeinhardt commented 6 years ago

Is there a particular reason why VecNormalize is only applied to 1-D observations? If yes, wouldn't it make sense to apply at least the rewards normalization? https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/47ddcbfab806c37ed19f438100300bd4d58c42f3/main.py#L68-L69

ikostrikov commented 6 years ago

In my experience it made things much worse for most of the atari games (both obs normalization and reward normalization): helps for pong, completely ruins breakout.

But I didn't carefully tune the hyper parameters.

I would add a flag.

codeislife99 commented 6 years ago

Same here, I always comment out the Normalization while training

ikostrikov commented 6 years ago

I was going to change normalization to this https://arxiv.org/pdf/1808.04355.pdf

Observation normalization. We run a random agent on our target environment for 10000 steps, then calculate the mean and standard deviation of the observation and use these to normalize the observations when training. This is useful to ensure that the features do not have very small variance at initialization and to have less variation across different environments.

timmeinhardt commented 6 years ago

This is actually an interesting idea for a normalisation of the observations. The resulting mean and standard deviation would then be used as starting points for the running mean and standard deviation, right? Because if they are fixed the normalisation would not be robust against new observations that are the result of a better agent exploring new regions of the environment.

PS.: I will submit a PR that adds two flags (normalisation of obs on/off and normalisation of rewards on/off) and resolves issue #87.

ikostrikov commented 6 years ago

I think they don't update it during training.

I think it's sufficient and also at the same time it reduces variance of the gradient updates since normalization is fixed.

Sounds good!