improving bare DDPG - Githubissues

vlad17 / mve

MVE: model-based value estimation

Apache License 2.0

10 stars 0 forks source link

improving bare DDPG #94

Open vlad17 opened 6 years ago

vlad17 commented 6 years ago

Consider the following improvements to DDPG, with values in parenthesis taken from Aurick (who did the SAC runs)

[x] Use a ringbuffer dataset for faster speed
[x] efficient ringbuffer sampling (samplemany) in dataset for speed / dedup every learner
[ ] share actor/critic weighs
[x] allow for training fixed numbers of steps
[ ] OU action exploration (theta = 0.1, sigma = 0.1 for -1 to 1 actions). Make sure this is mutually exclusive with param noise, and don't add unnecessary ops to the TF graph (like the unused noise).
[x] try various parameter-space noise exploration settings
[x] scale critic lr for ant down 3e-4 (1e-4?)
[x] try actor lr 1e-4 insead of usual 1e-3.
[x] add reward scaling: hc 0.05, ant 0.3, walker 0.05 [reward scaling is pretty cheap, I'd avoid if possible]
[x] try larger networks (2x128)
[ ] prioritized experience replay https://arxiv.org/abs/1511.05952
[ ] rm layer norm?
[ ] TD3 improvements

vlad17 commented 6 years ago

Here are Aurick's params (from SAC), based on https://github.com/rll/rllab/blob/master/rllab/algos/ddpg.py

{ "batch_size": 128, "discount": 0.99, "epoch_length": 1000, "eval_samples": 1000, "max_path_length": 1000, "min_pool_size": 1000, "n_epochs": 10000, "n_updates_per_time_step": 4, "qf_weight_decay": 0.0, "replay_pool_size": 1000000, "soft_target_tau": 0.01 }

Scale reward and q function learning rate varied across the envs: Half-Cheetah: "qf_learning_rate": 0.001, "scale_reward": 0.05,

Ant: "qf_learning_rate": 0.0003, "scale_reward": 0.3,

Walker: "qf_learning_rate": 0.001, "scale_reward": 0.05,

More from Aurick:

We used an implementation of DDPG by Vitchyr Pong (another grad student here). The policy LR is indeed 1e-4. We also used the OUStrategy with mu=0, theta=0.15, sigma=0.3 as the exploration strategy for DDPG. Our policy network had 2 hidden layers of 100 each.

vlad17 commented 6 years ago

another improvement, which will require some work would be to have shorter periods between updates (so parameters update faster). The best way I see for doing this is to make a Sampler class which keeps a running live environment (that gets auto-reset after episodes finish). Then sampler.sample(data) adds data right into the ringbuffer. Be sure to make terminals appropriately recorded.

Under this setup we'd have to change reporter.advance so that it accepts the number of episodes that have been processed in the previous iteration (which may be 0 in this setup).