Experiment Result

Abstract

Lookahead + RAdam optimizer significantly improves the performance of some RL algorithms (A2C (n-step), PPO) on continuous domain problems, but does not improve (A2C (GAE), SAC).

Methods

Implement RAdam and Lookahead optimizers
Update the optim_spec to replace Adam with Lookahead(RAdam) optimizer
Run the benchmark

Implementations inspired by/adapted from LiyuanLucasLiu/RAdam, lonePatient/lookahead_pytorch, and Less Wright's Medium article.

To Reproduce

Use this current commit e5988f04b2ca5935d253c0b16f01187e097e7a87 to run the spec files.

Results

All the results contributed will be added to the benchmark, and made publicly available on Dropbox.

We run benchmark to directly compare the performance differences between Adam and Lookahead(RAdam) using the same code and spec files, and only changing the optimizers (see the git diff of this PR). Due to limited computational resources, we focus the study on continuous environments from Roboschool.

We find that:

A2C (n-step), PPO gain significant improvements overall in both the standard Roboschool and the harder Humanoid environments.
A2C (n-step), previously failing completely on the harder Humanoid environments, is now able to learn
A2C (GAE) results are mixed, with some improvements and some degradation
SAC are not improved in all the environments, and we exclude the results from below. Instead, we provide a rerun/old benchmark result using Adam for comparison and benchmark update.

The results are tabulated below. In sum, the new results are run with:

A2C (GAE), A2C (n-step), PPO using Lookahead + RAdam optimizer
SAC using its Adam optimizer

New Roboschool benchmark

Legend:

Env. \ Alg.	A2C (GAE)	A2C (n-step)	PPO	SAC
RoboschoolAnt graph	787	1396	1843	2915
RoboschoolAtlasForwardWalk graph	59.87	88.04	172	800
RoboschoolHalfCheetah graph	712	439	1960	2497
RoboschoolHopper graph	710	285	2042	2045
RoboschoolInvertedDoublePendulum graph	996	4410	8076	8085
RoboschoolInvertedPendulum graph	995	978	986	941
RoboschoolReacher graph	12.9	10.16	19.51	19.99
RoboschoolWalker2d graph	280	220	1660	1894

Old Roboschool benchmark

Env. \ Alg.	A2C (GAE)	A2C (n-step)	PPO	SAC
RoboschoolAnt graph	1029.51	1148.76	1931.35	2914.75
RoboschoolAtlasForwardWalk graph	68.15	73.46	148.81	942.39
RoboschoolHalfCheetah graph	895.24	409.59	1838.69	2496.54
RoboschoolHopper graph	286.67	-187.91	2079.22	2251.36
RoboschoolInvertedDoublePendulum graph	1769.74	486.76	7967.03	8085.04
RoboschoolInvertedPendulum graph	1000.0	997.54	930.29	941.45
RoboschoolReacher graph	14.57	-6.18	19.18	19.99
RoboschoolWalker2d graph	413.26	141.83	1368.25	1894.05

New Humanoid benchmark

Humanoid environments are significantly harder. Note that due to the number of frames required, we could only run Async-SAC.

Env. \ Alg.	A2C (GAE)	A2C (n-step)	PPO	Async-SAC
RoboschoolHumanoid graph	99.31	54.58	2388	2621
RoboschoolHumanoidFlagrun graph	73.57	178	2014	2056
RoboschoolHumanoidFlagrunHarder graph	-429	253	680	280

Old Humanoid benchmark

Env. \ Alg.	A2C (GAE)	A2C (n-step)	PPO	Async-SAC
RoboschoolHumanoid	122.23 graph	-6029.02 graph	1554.03 graph	2621.46 graph
RoboschoolHumanoidFlagrun	93.48 graph	-2079.02 graph	1635.64 graph	1937.77 graph
RoboschoolHumanoidFlagrunHarder	-472.34 graph	-24620.71 graph	610.09 graph	280.18 graph

kengz / SLM-Lab

Add Lookahead+RAdam optimizer #416