About PPO - Githubissues

sweetice / Deep-reinforcement-learning-with-pytorch

PyTorch implementation of DQN, AC, ACER, A2C, A3C, PG, DDPG, TRPO, PPO, SAC, TD3 and ....

MIT License

3.88k stars 844 forks source link

Open LpLegend opened 3 years ago

LpLegend commented 3 years ago

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward 0.9 + score 0.1'

LpLegend commented 3 years ago

I have changed the activate function from relu to tanh, but there is nothing improvement.

heyfavour commented 3 years ago

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward 0.9 + score 0.1'

我也遇到这个问题，我咨询elegantrl作者，他说先tahn，再通过torch.distribution来sample action会影响信息熵，所以是没有办法收敛的，但是我不喜欢elegantrl的ppo写法，所以我还在找别人的代码

CoulsonZhao commented 3 years ago

Have you got the right code yet? Could you copy a link? Very appreciate!!

huang-chunyang commented 2 months ago

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward 0.9 + score 0.1'

You can change clip_param from 0.2 into 0.1, constrainting trust regions. This method can work!