Soft Actor-Critic - Githubissues

kengz commented 5 years ago

Feature / Fix

implement Soft Actor-Critic (SAC) from https://arxiv.org/abs/1801.01290
split and update roboschool benchmark specs
update random baselines for roboschool
change continuous policy std sampling from softplus to log-clamp-exp transform. verified working on PPO BipedalWalker
add discrete control to SAC using reparametrizable Categorical distribution (Gumbel-Softmax)

Roboschool (continuous control) Benchmark

Note that the Roboschool reward scales are different from MuJoCo's.

Env. \ Alg.	A2C (GAE)	A2C (n-step)	PPO	SAC
RoboschoolAnt				1153.87 graph
RoboschoolHalfCheetah				1204.68 graph
RoboschoolHopper				1161.24 graph
RoboschoolWalker2d				695.36 graph

LunarLander (discrete control) Benchmark



Trial graph	Moving average

CarloLucibello commented 5 years ago

Hi, just wanted to point out, the follow-up paper by the authors of SAC https://arxiv.org/abs/1812.05905 One of the main differences with the original paper is that they don't use a separate V network

kengz commented 5 years ago

Hi, just wanted to point out, the follow-up paper by the authors of SAC https://arxiv.org/abs/1812.05905 One of the main differences with the original paper is that they don't use a separate V network

@CarloLucibello Thanks for pointing that out. This PR implements the first version of SAC and adds a discrete control version. I'll implement the improved version in the next PR.

kengz / SLM-Lab

Soft Actor-Critic #398

Feature / Fix

Roboschool (continuous control) Benchmark

LunarLander (discrete control) Benchmark