openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
9.81k stars 2.18k forks source link

Discrete action version of SAC #148

Closed p-christ closed 4 years ago

p-christ commented 5 years ago

Hi,

In your SAC docs it says

An alternate version of SAC, which slightly changes the policy update rule, can be implemented to handle discrete action spaces.

I was then wondering if you knew of a good place for me to look to understand how SAC should work with discrete action spaces? Is there a paper or github repo for example that goes through this?

Thanks!

zlw21gxy commented 4 years ago
  1. you can checkout here https://www.reddit.com/r/reinforcementlearning/comments/bmm1dj/soft_actorcritic_with_discrete_actions/
  2. here is our implication of sac which can solve discrete space https://github.com/createamind/DRL/tree/master
janislavjankov commented 4 years ago

rlgraph has implementation of SAC with discrete actions using Gumbel-Softmax distribution. https://github.com/rlgraph/rlgraph/blob/master/rlgraph/agents/sac_agent.py

p-christ commented 4 years ago

i've also created an implementation of it here now too: https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/master/agents/actor_critic_agents/SAC_Discrete.py

ac-93 commented 4 years ago

A little late to the party but here's another two versions of SAC for discrete action spaces. https://github.com/ac-93/soft-actor-critic

sac_discrete_gb uses the gumbel softmax to re-parameterize the discrete distribution as suggested here https://stackoverflow.com/questions/56226133/soft-actor-critic-with-discrete-action-space. In this implementation there are two types of Q network

I've not yet been able to recreate the results shown for LunarLander, this could be due to a problem with the implementation.

sac_discrete_kl calculates the entropy and KL divergence as sums over the policy network outputs without any reparemterization. I think this is more closely aligned with what was suggested and validated by the spinning up author here https://www.reddit.com/r/reinforcementlearning/comments/bmm1dj/soft_actorcritic_with_discrete_actions/ Although I'm not 100% sure if this implementation is correct, it would be great to get some feedback.

I've found the results to be mixed, under performing standard DQN on some of the simple gym environments but doing quite well on Atari games. I haven't done too much testing though so this could be due to problems in the implementation.