Closed kengz closed 5 years ago
GumbelSoftmax
QMLPNet, QConvNet
Q(s,a) -> q
This results in better performance over the original SAC benchmark in PR #398 (*note however the Polyak coefficient was off in that PR)
Note that the Roboschool reward scales are different from MuJoCo's.
SAC improvements
GumbelSoftmax
distribution (custom)QMLPNet, QConvNet
forQ(s,a) -> q
in SACThis results in better performance over the original SAC benchmark in PR #398 (*note however the Polyak coefficient was off in that PR)
Roboschool (continuous control) Benchmark
graph
graph
graph
graph
LunarLander (discrete control) Benchmark