about the advantage values in PPO2

sweetice / Deep-reinforcement-learning-with-pytorch

PyTorch implementation of DQN, AC, ACER, A2C, A3C, PG, DDPG, TRPO, PPO, SAC, TD3 and ....

MIT License

3.88k stars 844 forks source link

Open Hardlygo opened 3 years ago

Hardlygo commented 3 years ago

I think that the advantage value here should be base on the old actor target_v = reward + args.gamma * self.critic_net(next_state)