Fatal bug in implementation of GAE

https://github.com/uvipen/Super-mario-bros-PPO-pytorch/blob/ab4248d715346c6adc33c2157455e2b98c130bcc/train.py#L119 It should be

gae = gae * opt.gamma * opt.tau*(1 - done)

Suppose worker 1 has to sample 500 steps. The game prematurely ends at 250 steps, the worker will restart the game and continue sampling 250 steps. The trajectory would be s1,s2,...,s250,s1',s2',...s250'. The wrong implementation forgets to reset GAE to zero when calculating GAE of s250. It will make GAE bigger than expected. This will cause the advantage of s250 become bigger and bigger, which will make the network think you should output a250 when seeing s250. (However, this is not true, performing s250 at a250 make you die).

Therefore, the critic loss diverges (advantage becomes bigger and bigger, network can't predict it right). Stuck at action that make you die. The agent does not learn anything.

uvipen / Super-mario-bros-PPO-pytorch

Fatal bug in implementation of GAE #21