ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.56k stars 831 forks source link

Bad mask is bad #182

Open NTT123 opened 5 years ago

NTT123 commented 5 years ago

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/d4c68c1ce8d87a7c4ae7a7d88b54daa197785fc4/a2c_ppo_acktr/storage.py#L82-L84

Sorry if I could be wrong about this. When a time limit is past, the env resets its observation, therefore, the transition (s, r, s') is considerably bad. Current implementation implies that a bad transition has zero advantage (before normalization) and zero loss in predicting value function. I would argue that this causes biases in the gradients because we actually have no information to indicate that the advantage is zero and the value loss is zero. A correct implementation, imho, should remove these bad transitions.

ikostrikov commented 5 years ago

I agree, it definitely might introduce some bias in gradients. I will accept a PL that removes the transitions from sampling.

MarcoMeter commented 4 years ago

It is considered being a bad transition, because the new state is the initial one from a completely new episode, correct?

ikostrikov commented 4 years ago

@mabirck yes, that's correct.

The correct way to fix it would to modify the Vec wrappers so that envs are not reset automatically and the state is not overwritten and then add something like env.maybe_reset() that checks and resets states if necessary.

xkianteb commented 3 years ago

This is issue is 100% evident in hopper. There are few ways to fix this, either: [1] remove it or [2] We either treat the final transition as any other terminal transition, e.g. the value target for the last state is equal to the final reward, or we take the fact that we do not know what would happen if the episode was not terminated into account. In the latter case, we set the advantage for the final state to zero and its value target to the current value function see: (https://github.com/fabiopardo/tonic/blob/48a7b72757f7629c8577eb062373854a3663c255/tonic/replays/utils.py#L1and Handling of timesteps in (https://arxiv.org/pdf/2006.05990.pdf)

xkianteb commented 3 years ago

You can also just add a TimeFeatureWrapper.

(https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/master/sb3_contrib/common/wrappers/time_feature.py)