nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.57k stars 332 forks source link

Including GAE #26

Closed CesMak closed 4 years ago

CesMak commented 4 years ago

Hey there,

you used Monte Carlo Estimate - would it no also be nice to have GAE (Generalized Advantage estimation?)

The function should be something like:

I am not sure how exactly I can include gae in the code....

    def get_advantages(self, values, masks, rewards, gamma):
        returns = []
        gae = 0
        for i in reversed(range(len(rewards))):
            delta = rewards[i-1] + gamma * values[i] * masks[i-1] - values[i-1]
            gae = delta + gamma * 0.95 * masks[i-1] * gae
            returns.insert(0, gae + values[i-1])

        adv = np.array(returns) - values.detach().numpy()
        adv = torch.tensor(adv.astype(np.float32)).float()
        # Normalizing advantages
        return returns, (adv - adv.mean()) / (adv.std() + 1e-5)
       # Monte Carlo estimate of rewards:
        rewards = []
        discounted_reward = 0
        for reward, is_terminal in zip(reversed(memory.rewards), reversed(memory.is_terminals)):
            if is_terminal:
                discounted_reward = 0
            discounted_reward = reward + (self.gamma * discounted_reward)
            rewards.insert(0, discounted_reward)
nikhilbarhate99 commented 4 years ago

I do not think the added complexity is worth it in this repo, since the goal is to provide a simplistic and beginner friendly implementation.

Also, Bootstrapping values does require the experience to be collected from parallel workers in order to work practically. Source: skip to 54:19 of (https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=6)

I do not know how GAE will perform on a single worker and I also do not have the time to re-test GAE on different environments.

If your implementation works correctly, then you can add it in your fork of this repo.