thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.78k stars 1.12k forks source link

Same episodic returns every epoch #667

Closed c4cld closed 2 years ago

c4cld commented 2 years ago

Hello, I use PPO with a customized environment to train an agent. However, I found every time the episodic returns are the same. Not sure what's going wrong...

Here's the output:

Epoch #1: 2001it [00:36, 54.27it/s, env_step=2000, len=1001, loss/actor=11.228, loss/critic1=0.062, loss/critic2=0.064, n/ep=0, n/st=1, rew=-943.40]                          
Epoch #2:   0%|          | 1/2000 [00:00<00:37, 53.87it/s, env_step=2001, len=1001, loss/actor=11.233, loss/critic1=0.061, loss/critic2=0.064, n/ep=0, n/st=1, rew=-943.40]Epoch #1: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #2: 2001it [00:38, 52.39it/s, env_step=4000, len=2000, loss/actor=19.322, loss/critic1=0.155, loss/critic2=0.158, n/ep=0, n/st=1, rew=-1884.80]                          
Epoch #3:   0%|          | 0/2000 [00:00<?, ?it/s]Epoch #2: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #3: 2001it [00:39, 51.08it/s, env_step=6000, len=2000, loss/actor=26.462, loss/critic1=0.356, loss/critic2=0.359, n/ep=0, n/st=1, rew=-1884.80]
Epoch #3: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #4: 2001it [00:37, 52.73it/s, env_step=8000, len=2000, loss/actor=32.905, loss/critic1=0.821, loss/critic2=0.835, n/ep=0, n/st=1, rew=-1884.80]                          
Epoch #4: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #5: 2001it [00:37, 53.59it/s, env_step=10000, len=2000, loss/actor=38.699, loss/critic1=0.514, loss/critic2=0.513, n/ep=0, n/st=1, rew=-1884.80]                          
Epoch #6:   0%|          | 0/2000 [00:00<?, ?it/s]Epoch #5: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #6: 2001it [00:37, 53.61it/s, env_step=12000, len=2000, loss/actor=43.835, loss/critic1=0.993, loss/critic2=0.992, n/ep=0, n/st=1, rew=-1884.80]                          
Epoch #6: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #7: 2001it [00:37, 53.69it/s, env_step=14000, len=2000, loss/actor=48.587, loss/critic1=1.452, loss/critic2=1.450, n/ep=0, n/st=1, rew=-1884.80]                          
Epoch #7: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #8: 2001it [00:37, 53.64it/s, env_step=16000, len=2000, loss/actor=52.685, loss/critic1=1.356, loss/critic2=1.346, n/ep=0, n/st=1, rew=-1884.80]                          
Epoch #8: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #9: 2001it [00:37, 53.65it/s, env_step=18000, len=2000, loss/actor=56.454, loss/critic1=1.917, loss/critic2=1.921, n/ep=0, n/st=1, rew=-1884.80]                          
Epoch #10:   0%|          | 0/2000 [00:00<?, ?it/s]Epoch #9: test_reward: -943.400000 ± 0.000000, best_reward: -943.400000 ± 0.000000 in #0
Epoch #10: 2001it [00:37, 53.78it/s, env_step=20000, len=2000, loss/actor=59.955, loss/critic1=2.231, loss/critic2=2.229, n/ep=0, n/st=1, rew=-1884.80]   
Trinkle23897 commented 2 years ago

Have you tried to log the result of env step (obs, rew, done, info)?

c4cld commented 2 years ago

Thanks for your answer! I just find the bug: it's because the actor net output the same action every time.