nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.63k stars 340 forks source link

update() retains discounted_reward from previous episodes #8

Closed BigBadBurrow closed 4 years ago

BigBadBurrow commented 4 years ago

In the update() method the discounted_reward is always calculated using a gamma of the previous discounted_reward, but there's no break between episodes so the reward from one episode is carried across to the next, which I assume cannot be correct.

Suggest adding terminal_states list to the Memory class, and then setting the discounted_reward = 0 when a new episode starts.

BigBadBurrow commented 4 years ago

Remember, the array is reversed so I think you'd need to set the discounted_reward = 0 first:

            if is_terminal:
                discounted_reward = 0
            discounted_reward = reward + (self.gamma * discounted_reward)
            rewards.insert(0, discounted_reward)
nikhilbarhate99 commented 4 years ago

Thanks!