vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.41k stars 616 forks source link

Performance compared with SB3 #405

Closed qiuruiyu closed 11 months ago

qiuruiyu commented 1 year ago

Problem Description

I found that, on my own customized env, based on gymnasium. It shows a great convergence with stable-baselines3, with a final reward around 17.9. However, when using cleanrl, reward is only about 20, no more higher than that. It makes me really, confused, and the difference of the reward makes the controller trained with RL performance not so good on my evaluation.

Possible Solution

Is there some problem with the trade-off between the exploitation and exploration? Or some problem with the setting of action std?

qiuruiyu commented 1 year ago

update: When I just save model while training, rewards goes right, but when I add evaluation, rewards get stuck. It's confusing really.

vwxyzjn commented 11 months ago

I am afraid I can't help too much there. SB3's PPO is slightly difference. See

image

https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Good luck!