vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.54k stars 631 forks source link

SAC discrete #266

Closed timoklein closed 10 months ago

timoklein commented 2 years ago

Hey there!

I've used this repo's SAC code as starting point for an implementation of SAC-discrete (paper) for a project of mine. If you're interested, I'd be willing to contribute it to cleanRL.

The differences to SAC for continuous action spaces aren't too big and I can start from a working implementation, so this shouldn't take too long.

What do you think?

Checklist

vwxyzjn commented 2 years ago

Hi @timoklein, thanks for being interested in submitting a contribution! SAC discrete indeed sounds like an interesting addition to CleanRL. I just glanced at the paper and would recommend prototyping a sac_atari.py to work with Atari games as done in the paper.

I was a bit surprised to see the algorithm performs poorly on Pong. Do you have any insight on this? Maybe this is some implementation details stuff... CC @dosssman, who was the main contributor for the CleanRL's SAC implementation.

timoklein commented 2 years ago

I just glanced at the paper and would recommend prototyping a sac_atari.py to work with Atari games as done in the paper.

Getting to work on it!

I was a bit surprised to see the algorithm performs poorly on Pong. Do you have any insight on this? Maybe this is some implementation details stuff... CC @dosssman, who was the main contributor for the CleanRL's SAC implementation.

For reference, here are the reported results in the paper: SAC_discrete_results

In my opinion, the bad results on Pong are due to the evaluation scheme. Evaluation at 100k time steps on Atari is in my opinion a very tough setting for a "standard" model-free RL (some newer methods like CURL or DrQ may perform better). Rainbow also doesn't improve over a random agent in this setting. Therefore we should focus the evaluation on games where meaningful improvements over a random baseline can be made, e.g. Seaquest, James Bond or Road Runner.

vwxyzjn commented 10 months ago

Closed by #270