vlad17 / mve

MVE: model-based value estimation
Apache License 2.0
10 stars 0 forks source link

improving bare DDPG #94

Open vlad17 opened 6 years ago

vlad17 commented 6 years ago

Consider the following improvements to DDPG, with values in parenthesis taken from Aurick (who did the SAC runs)

vlad17 commented 6 years ago

Here are Aurick's params (from SAC), based on https://github.com/rll/rllab/blob/master/rllab/algos/ddpg.py

{ "batch_size": 128, "discount": 0.99, "epoch_length": 1000, "eval_samples": 1000, "max_path_length": 1000, "min_pool_size": 1000, "n_epochs": 10000, "n_updates_per_time_step": 4, "qf_weight_decay": 0.0, "replay_pool_size": 1000000, "soft_target_tau": 0.01 }

Scale reward and q function learning rate varied across the envs: Half-Cheetah: "qf_learning_rate": 0.001, "scale_reward": 0.05,

Ant: "qf_learning_rate": 0.0003, "scale_reward": 0.3,

Walker: "qf_learning_rate": 0.001, "scale_reward": 0.05,

More from Aurick:

We used an implementation of DDPG by Vitchyr Pong (another grad student here). The policy LR is indeed 1e-4. We also used the OUStrategy with mu=0, theta=0.15, sigma=0.3 as the exploration strategy for DDPG. Our policy network had 2 hidden layers of 100 each.

vlad17 commented 6 years ago

another improvement, which will require some work would be to have shorter periods between updates (so parameters update faster). The best way I see for doing this is to make a Sampler class which keeps a running live environment (that gets auto-reset after episodes finish). Then sampler.sample(data) adds data right into the ringbuffer. Be sure to make terminals appropriately recorded.

Under this setup we'd have to change reporter.advance so that it accepts the number of episodes that have been processed in the previous iteration (which may be 0 in this setup).