DDPG not converging/slow training?

Hello, i was trying to train a Hopper Gym agent with the rllab++ version of DDPG found in sandbox/rocky/tf/algos/ddpg.py.

I initially run the experiment as suggested by @shaneshixiang with the algo_gym_stub.py launcher script, modifying as needed the flags in launcher_utils.py. Yet the agent is not learning, even with different policy hidden sizes and reward scale, leading to a rapid explosion of the average action and other log parameters.

To be sure of my experimental setup parameters, i wrote this script avoiding the stub feature, but has the same problems. Please note that every attribute not specified here has default value as given by the author.

from sandbox.rocky.tf.envs.base import TfEnv
from rllab.envs.normalized_env import normalize
from sandbox.rocky.tf.policies.gaussian_mlp_policy import GaussianMLPPolicy
from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
from sandbox.rocky.tf.algos.ddpg import DDPG
from rllab.misc.instrument import stub,run_experiment_lite
from rllab.envs.gym_env import GymEnv
from sandbox.rocky.tf.exploration_strategies.ou_strategy import OUStrategy
from sandbox.rocky.tf.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction
from rllab import config
import rllab.misc.logger as logger

def set_experiment(*_):
    logger.set_snapshot_dir(config.LOG_DIR)
    env = TfEnv(normalize(GymEnv( "Hopper-v1" ,force_reset=True, record_video=False,record_log=False)))
    policy = GaussianMLPPolicy(
        name="policy",
        env_spec=env.spec,
        hidden_sizes=(400, 300)
    )
    es = OUStrategy(env_spec=env.spec)
    qf =ContinuousMLPQFunction(env_spec=env.spec)

    algo = DDPG(
        env=env,
        es=es,
        qf=qf,
        policy=policy,
        batch_size=4000,
        n_itr=1000,
        discount=0.99,
        step_size=0.01,
        scale_reward=0.1
    )
    algo.train()

set_experiment()

The output is the following:

2017-11-10 11:17 | observation space: Box(11,)
2017-11-10 11:17 | action space: Box(3,)
2017-11-10 11:17 | Populating workers...
2017-11-10 11:17 | Populated
2017-11-10 11:17 | [init_opt] using target qf.
2017-11-10 11:17 | [init_opt] using target policy.
2017-11-10 11:17 | No checkpoint C:/data\params.chk
2017-11-10 11:17 | Critic batch size=32, Actor batch size=32
2017-11-10 11:17 | epoch #0 | Training started
0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #0 | Training finished
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #0 | Trained qf 0 steps, policy 0 steps
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #1 | Training started
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #1 | Training finished
2017-11-10 11:17 | epoch #1 | Trained qf 0 steps, policy 0 steps
2017-11-10 11:17 | epoch #2 | Training started
0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #2 | Training finished
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #2 | Trained qf 0 steps, policy 0 steps
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #3 | Training started
0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #3 | Training finished
2017-11-10 11:17 | epoch #3 | Trained qf 0 steps, policy 0 steps
2017-11-10 11:17 | epoch #4 | Training started
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #4 | Training finished
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #4 | Trained qf 0 steps, policy 0 steps
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #5 | Training started
0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #5 | Training finished
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #5 | Trained qf 0 steps, policy 0 steps
2017-11-10 11:17 | epoch #6 | Training started
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #6 | Training finished
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #6 | Trained qf 0 steps, policy 0 steps
2017-11-10 11:17 | epoch #7 | Training started
0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #7 | Training finished
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #7 | Trained qf 0 steps, policy 0 steps
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #8 | Training started
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #8 | Training finished
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #8 | Trained qf 0 steps, policy 0 steps
2017-11-10 11:17 | epoch #9 | Training started
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #9 | Training finished
Total time elapsed: 00:00:00
2017-11-10 11:17 | epoch #9 | Trained qf 0 steps, policy 0 steps
2017-11-10 11:17 | epoch #9 | Collecting samples for evaluation
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
C:\Users\Andrea\AppData\Local\Programs\Python\Python35\lib\site-packages\numpy\core\fromnumeric.py:2909: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
C:\Users\Andrea\AppData\Local\Programs\Python\Python35\lib\site-packages\numpy\core\_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
2017-11-10 11:17 | -----------------------  ----------
2017-11-10 11:17 | Epoch                      9
2017-11-10 11:17 | Iteration                  9
2017-11-10 11:17 | AverageReturn             10.389
2017-11-10 11:17 | StdReturn                  6.05704
2017-11-10 11:17 | MaxReturn                 49.6197
2017-11-10 11:17 | MinReturn                  1.93009
2017-11-10 11:17 | AverageEsReturn           10.0112
2017-11-10 11:17 | StdEsReturn                7.38025
2017-11-10 11:17 | MaxEsReturn               61.3863
2017-11-10 11:17 | MinEsReturn               -0.764754
2017-11-10 11:17 | AverageDiscountedReturn    9.73888
2017-11-10 11:17 | AverageAction              1.21038
2017-11-10 11:17 | QFunRegParamNorm           7.16876
2017-11-10 11:17 | AveragePolicySurr        nan
2017-11-10 11:17 | PolicyRegParamNorm        19.2498
2017-11-10 11:17 | AveragePolicyStd           1
2017-11-10 11:17 | -----------------------  ----------
2017-11-10 11:17 | epoch #10 | Training started
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #10 | Training finished
Total time elapsed: 00:00:06
2017-11-10 11:17 | epoch #10 | Trained qf 1000 steps, policy 1000 steps
2017-11-10 11:17 | epoch #10 | Collecting samples for evaluation
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
2017-11-10 11:17 | -----------------------  -----------
2017-11-10 11:17 | Epoch                      10
2017-11-10 11:17 | Iteration                  10
2017-11-10 11:17 | AverageReturn              54.587
2017-11-10 11:17 | StdReturn                   1.8097
2017-11-10 11:17 | MaxReturn                  63.6004
2017-11-10 11:17 | MinReturn                  49.7244
2017-11-10 11:17 | AverageEsReturn            34.8509
2017-11-10 11:17 | StdEsReturn                11.0104
2017-11-10 11:17 | MaxEsReturn                50.1015
2017-11-10 11:17 | MinEsReturn                 3.97748
2017-11-10 11:17 | AverageDiscountedReturn    45.3728
2017-11-10 11:17 | AverageAction            4366.88
2017-11-10 11:17 | QFunRegParamNorm            7.90616
2017-11-10 11:17 | AverageQLoss                0.412717
2017-11-10 11:17 | AverageQ                   -2.11881
2017-11-10 11:17 | AverageAbsQ                 2.1912
2017-11-10 11:17 | AverageY                   -2.14001
2017-11-10 11:17 | AverageAbsY                 2.21835
2017-11-10 11:17 | AverageAbsQYDiff            0.320702
2017-11-10 11:17 | AveragePolicySurr          -5.22323
2017-11-10 11:17 | PolicyRegParamNorm         27.5306
2017-11-10 11:17 | AveragePolicyStd            1
2017-11-10 11:17 | -----------------------  -----------
2017-11-10 11:17 | epoch #11 | Training started
0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #11 | Training finished
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:17 | epoch #11 | Trained qf 1000 steps, policy 1000 steps
Total time elapsed: 00:00:06
2017-11-10 11:17 | epoch #11 | Collecting samples for evaluation
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
2017-11-10 11:17 | -----------------------  --------------
2017-11-10 11:17 | Epoch                        11
2017-11-10 11:17 | Iteration                    11
2017-11-10 11:17 | AverageReturn                 3.08544
2017-11-10 11:17 | StdReturn                     0.0174516
2017-11-10 11:17 | MaxReturn                     3.12203
2017-11-10 11:17 | MinReturn                     3.04823
2017-11-10 11:17 | AverageEsReturn               5.54646
2017-11-10 11:17 | StdEsReturn                  10.3622
2017-11-10 11:17 | MaxEsReturn                  76.7926
2017-11-10 11:17 | MinEsReturn                   3.05211
2017-11-10 11:17 | AverageDiscountedReturn       3.03627
2017-11-10 11:17 | AverageAction            136967
2017-11-10 11:17 | QFunRegParamNorm             10.5495
2017-11-10 11:17 | AverageQLoss                  5.0424
2017-11-10 11:17 | AverageQ                      1.21598
2017-11-10 11:17 | AverageAbsQ                   7.04755
2017-11-10 11:17 | AverageY                      1.3116
2017-11-10 11:17 | AverageAbsY                   7.08942
2017-11-10 11:17 | AverageAbsQYDiff              1.10194
2017-11-10 11:17 | AveragePolicySurr          -118.029
2017-11-10 11:17 | PolicyRegParamNorm           51.2095
2017-11-10 11:17 | AveragePolicyStd              1
2017-11-10 11:17 | -----------------------  --------------
2017-11-10 11:17 | epoch #12 | Training started
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:18 | epoch #12 | Training finished
Total time elapsed: 00:00:06
2017-11-10 11:18 | epoch #12 | Trained qf 1000 steps, policy 1000 steps
2017-11-10 11:18 | epoch #12 | Collecting samples for evaluation
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
2017-11-10 11:18 | -----------------------  --------------
2017-11-10 11:18 | Epoch                        12
2017-11-10 11:18 | Iteration                    12
2017-11-10 11:18 | AverageReturn                 5.2974
2017-11-10 11:18 | StdReturn                     0.0433093
2017-11-10 11:18 | MaxReturn                     5.41067
2017-11-10 11:18 | MinReturn                     5.07163
2017-11-10 11:18 | AverageEsReturn               4.36692
2017-11-10 11:18 | StdEsReturn                   1.09891
2017-11-10 11:18 | MaxEsReturn                   5.38162
2017-11-10 11:18 | MinEsReturn                   2.24463
2017-11-10 11:18 | AverageDiscountedReturn       5.13632
2017-11-10 11:18 | AverageAction            803172
2017-11-10 11:18 | QFunRegParamNorm             24.8687
2017-11-10 11:18 | AverageQLoss               8132.4
2017-11-10 11:18 | AverageQ                    316.743
2017-11-10 11:18 | AverageAbsQ                 316.743
2017-11-10 11:18 | AverageY                    322.315
2017-11-10 11:18 | AverageAbsY                 322.317
2017-11-10 11:18 | AverageAbsQYDiff             34.7282
2017-11-10 11:18 | AveragePolicySurr         -3052.48
2017-11-10 11:18 | PolicyRegParamNorm           96.1691
2017-11-10 11:18 | AveragePolicyStd              1
2017-11-10 11:18 | -----------------------  --------------
2017-11-10 11:18 | epoch #13 | Training started
0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:18 | epoch #13 | Training finished
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 11:18 | epoch #13 | Trained qf 1000 steps, policy 1000 steps
Total time elapsed: 00:00:06
2017-11-10 11:18 | epoch #13 | Collecting samples for evaluation
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:04
2017-11-10 11:18 | -----------------------  ----------------
2017-11-10 11:18 | Epoch                        13
2017-11-10 11:18 | Iteration                    13
2017-11-10 11:18 | AverageReturn                 5.29554
2017-11-10 11:18 | StdReturn                     0.0436027
2017-11-10 11:18 | MaxReturn                     5.41255
2017-11-10 11:18 | MinReturn                     5.07709
2017-11-10 11:18 | AverageEsReturn               5.28534
2017-11-10 11:18 | StdEsReturn                   0.104083
2017-11-10 11:18 | MaxEsReturn                   5.3769
2017-11-10 11:18 | MinEsReturn                   4.34235
2017-11-10 11:18 | AverageDiscountedReturn       5.13451
2017-11-10 11:18 | AverageAction                 2.03434e+06
2017-11-10 11:18 | QFunRegParamNorm             48.083
2017-11-10 11:18 | AverageQLoss                  1.06782e+06
2017-11-10 11:18 | AverageQ                   4103.61
2017-11-10 11:18 | AverageAbsQ                4103.61
2017-11-10 11:18 | AverageY                   4210.98
2017-11-10 11:18 | AverageAbsY                4210.98
2017-11-10 11:18 | AverageAbsQYDiff            491.12
2017-11-10 11:18 | AveragePolicySurr        -23603.5
2017-11-10 11:18 | PolicyRegParamNorm          146.593
2017-11-10 11:18 | AveragePolicyStd              1
2017-11-10 11:18 | -----------------------  ----------------
2017-11-10 11:18 | epoch #14 | Training started
0% [#################             ] 100% | ETA: 00:00:02

Can someone explain to me why this is happening? Every algo in launcher_stub_utils based on ddpg.py has this problem, while trpo.py, vpg.py is converging properly and faster. Thanks in advance to everyone that will help me!

It seems you are not using DeterministicMLPPolicy & tanh activation for policy output for DDPG? This is crucial.

Because I've written codes to be used for many domains and many algorithm variants, please do not trust default parameters for each method.

A potential problem is that the last time I updated rllab++, I ended by syncing with updated rllab codes, after which I did not check performance. Let me know if problem persists.

On Fri, Nov 10, 2017 at 10:35 AM, SystemSigma notifications@github.com wrote:

Hello, i was trying to train a Hopper Gym agent with the rllab++ version of DDPG found in sandbox/rocky/tf/algos/ddpg.py.

I initially run the experiment as suggested by @shaneshixiang https://github.com/shaneshixiang with the algo_gym_stub.py launcher script, modifying as needed the flags in launcher_utils.py. Yet the agent is not learning, even with different policy hidden sizes and reward scale, leading to a rapid explosion of the average action and other log parameters.

To be sure of my experimental setup parameters, i wrote this script to avoid the stub feature, but has the same problems. Please note that every attribute not specified here has default value as given by the author.

from rllab.envs.normalized_env import normalize from sandbox.rocky.tf.policies.gaussian_mlp_policy import GaussianMLPPolicy from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline from sandbox.rocky.tf.algos.ddpg import DDPG from rllab.misc.instrument import stub,run_experiment_lite from rllab.envs.gym_env import GymEnv from sandbox.rocky.tf.exploration_strategies.ou_strategy import OUStrategy from sandbox.rocky.tf.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction from rllab import config import rllab.misc.logger as logger

def setexperiment(*): logger.set_snapshot_dir(config.LOG_DIR) env = TfEnv(normalize(GymEnv( "Hopper-v1" ,force_reset=True, record_video=False,record_log=False))) policy = GaussianMLPPolicy( name="policy", env_spec=env.spec, hidden_sizes=(400, 300) ) es = OUStrategy(env_spec=env.spec) qf =ContinuousMLPQFunction(env_spec=env.spec)
algo = DDPG(
    env=env,
    es=es,
    qf=qf,
    policy=policy,
    batch_size=4000,
    n_itr=1000,
    discount=0.99,
    step_size=0.01,
    scale_reward=0.1
)
algo.train()
set_experiment()

The output is the following:

2017-11-10 11:17 | observation space: Box(11,) 2017-11-10 11:17 | action space: Box(3,) 2017-11-10 11:17 | Populating workers... 2017-11-10 11:17 | Populated 2017-11-10 11:17 | [init_opt] using target qf. 2017-11-10 11:17 | [init_opt] using target policy. 2017-11-10 11:17 | No checkpoint C:/data\params.chk 2017-11-10 11:17 | Critic batch size=32, Actor batch size=32 2017-11-10 11:17 | epoch #0 | Training started 0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #0 | Training finished 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #0 | Trained qf 0 steps, policy 0 steps Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #1 | Training started 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #1 | Training finished 2017-11-10 11:17 | epoch #1 | Trained qf 0 steps, policy 0 steps 2017-11-10 11:17 | epoch #2 | Training started 0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #2 | Training finished 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #2 | Trained qf 0 steps, policy 0 steps Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #3 | Training started 0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #3 | Training finished 2017-11-10 11:17 | epoch #3 | Trained qf 0 steps, policy 0 steps 2017-11-10 11:17 | epoch #4 | Training started 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00 0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #4 | Training finished 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #4 | Trained qf 0 steps, policy 0 steps Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #5 | Training started 0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #5 | Training finished 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #5 | Trained qf 0 steps, policy 0 steps 2017-11-10 11:17 | epoch #6 | Training started 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #6 | Training finished Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #6 | Trained qf 0 steps, policy 0 steps 2017-11-10 11:17 | epoch #7 | Training started 0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #7 | Training finished 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #7 | Trained qf 0 steps, policy 0 steps Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #8 | Training started 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #8 | Training finished Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #8 | Trained qf 0 steps, policy 0 steps 2017-11-10 11:17 | epoch #9 | Training started 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #9 | Training finished Total time elapsed: 00:00:00 2017-11-10 11:17 | epoch #9 | Trained qf 0 steps, policy 0 steps 2017-11-10 11:17 | epoch #9 | Collecting samples for evaluation 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04 C:\Users\Andrea\AppData\Local\Programs\Python\Python35\lib\site-packages\numpy\core\fromnumeric.py:2909: RuntimeWarning: Mean of empty slice. out=out, **kwargs) C:\Users\Andrea\AppData\Local\Programs\Python\Python35\lib\site-packages\numpy\core_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) 2017-11-10 11:17 | ----------------------- ---------- 2017-11-10 11:17 | Epoch 9 2017-11-10 11:17 | Iteration 9 2017-11-10 11:17 | AverageReturn 10.389 2017-11-10 11:17 | StdReturn 6.05704 2017-11-10 11:17 | MaxReturn 49.6197 2017-11-10 11:17 | MinReturn 1.93009 2017-11-10 11:17 | AverageEsReturn 10.0112 2017-11-10 11:17 | StdEsReturn 7.38025 2017-11-10 11:17 | MaxEsReturn 61.3863 2017-11-10 11:17 | MinEsReturn -0.764754 2017-11-10 11:17 | AverageDiscountedReturn 9.73888 2017-11-10 11:17 | AverageAction 1.21038 2017-11-10 11:17 | QFunRegParamNorm 7.16876 2017-11-10 11:17 | AveragePolicySurr nan 2017-11-10 11:17 | PolicyRegParamNorm 19.2498 2017-11-10 11:17 | AveragePolicyStd 1 2017-11-10 11:17 | ----------------------- ---------- 2017-11-10 11:17 | epoch #10 | Training started 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #10 | Training finished Total time elapsed: 00:00:06 2017-11-10 11:17 | epoch #10 | Trained qf 1000 steps, policy 1000 steps 2017-11-10 11:17 | epoch #10 | Collecting samples for evaluation 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04 2017-11-10 11:17 | ----------------------- ----------- 2017-11-10 11:17 | Epoch 10 2017-11-10 11:17 | Iteration 10 2017-11-10 11:17 | AverageReturn 54.587 2017-11-10 11:17 | StdReturn 1.8097 2017-11-10 11:17 | MaxReturn 63.6004 2017-11-10 11:17 | MinReturn 49.7244 2017-11-10 11:17 | AverageEsReturn 34.8509 2017-11-10 11:17 | StdEsReturn 11.0104 2017-11-10 11:17 | MaxEsReturn 50.1015 2017-11-10 11:17 | MinEsReturn 3.97748 2017-11-10 11:17 | AverageDiscountedReturn 45.3728 2017-11-10 11:17 | AverageAction 4366.88 2017-11-10 11:17 | QFunRegParamNorm 7.90616 2017-11-10 11:17 | AverageQLoss 0.412717 2017-11-10 11:17 | AverageQ -2.11881 2017-11-10 11:17 | AverageAbsQ 2.1912 2017-11-10 11:17 | AverageY -2.14001 2017-11-10 11:17 | AverageAbsY 2.21835 2017-11-10 11:17 | AverageAbsQYDiff 0.320702 2017-11-10 11:17 | AveragePolicySurr -5.22323 2017-11-10 11:17 | PolicyRegParamNorm 27.5306 2017-11-10 11:17 | AveragePolicyStd 1 2017-11-10 11:17 | ----------------------- ----------- 2017-11-10 11:17 | epoch #11 | Training started 0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:17 | epoch #11 | Training finished 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:17 | epoch #11 | Trained qf 1000 steps, policy 1000 steps Total time elapsed: 00:00:06 2017-11-10 11:17 | epoch #11 | Collecting samples for evaluation 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04 2017-11-10 11:17 | ----------------------- -------------- 2017-11-10 11:17 | Epoch 11 2017-11-10 11:17 | Iteration 11 2017-11-10 11:17 | AverageReturn 3.08544 2017-11-10 11:17 | StdReturn 0.0174516 2017-11-10 11:17 | MaxReturn 3.12203 2017-11-10 11:17 | MinReturn 3.04823 2017-11-10 11:17 | AverageEsReturn 5.54646 2017-11-10 11:17 | StdEsReturn 10.3622 2017-11-10 11:17 | MaxEsReturn 76.7926 2017-11-10 11:17 | MinEsReturn 3.05211 2017-11-10 11:17 | AverageDiscountedReturn 3.03627 2017-11-10 11:17 | AverageAction 136967 2017-11-10 11:17 | QFunRegParamNorm 10.5495 2017-11-10 11:17 | AverageQLoss 5.0424 2017-11-10 11:17 | AverageQ 1.21598 2017-11-10 11:17 | AverageAbsQ 7.04755 2017-11-10 11:17 | AverageY 1.3116 2017-11-10 11:17 | AverageAbsY 7.08942 2017-11-10 11:17 | AverageAbsQYDiff 1.10194 2017-11-10 11:17 | AveragePolicySurr -118.029 2017-11-10 11:17 | PolicyRegParamNorm 51.2095 2017-11-10 11:17 | AveragePolicyStd 1 2017-11-10 11:17 | ----------------------- -------------- 2017-11-10 11:17 | epoch #12 | Training started 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:18 | epoch #12 | Training finished Total time elapsed: 00:00:06 2017-11-10 11:18 | epoch #12 | Trained qf 1000 steps, policy 1000 steps 2017-11-10 11:18 | epoch #12 | Collecting samples for evaluation 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04 2017-11-10 11:18 | ----------------------- -------------- 2017-11-10 11:18 | Epoch 12 2017-11-10 11:18 | Iteration 12 2017-11-10 11:18 | AverageReturn 5.2974 2017-11-10 11:18 | StdReturn 0.0433093 2017-11-10 11:18 | MaxReturn 5.41067 2017-11-10 11:18 | MinReturn 5.07163 2017-11-10 11:18 | AverageEsReturn 4.36692 2017-11-10 11:18 | StdEsReturn 1.09891 2017-11-10 11:18 | MaxEsReturn 5.38162 2017-11-10 11:18 | MinEsReturn 2.24463 2017-11-10 11:18 | AverageDiscountedReturn 5.13632 2017-11-10 11:18 | AverageAction 803172 2017-11-10 11:18 | QFunRegParamNorm 24.8687 2017-11-10 11:18 | AverageQLoss 8132.4 2017-11-10 11:18 | AverageQ 316.743 2017-11-10 11:18 | AverageAbsQ 316.743 2017-11-10 11:18 | AverageY 322.315 2017-11-10 11:18 | AverageAbsY 322.317 2017-11-10 11:18 | AverageAbsQYDiff 34.7282 2017-11-10 11:18 | AveragePolicySurr -3052.48 2017-11-10 11:18 | PolicyRegParamNorm 96.1691 2017-11-10 11:18 | AveragePolicyStd 1 2017-11-10 11:18 | ----------------------- -------------- 2017-11-10 11:18 | epoch #13 | Training started 0% [############################# ] 100% | ETA: 00:00:002017-11-10 11:18 | epoch #13 | Training finished 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 11:18 | epoch #13 | Trained qf 1000 steps, policy 1000 steps Total time elapsed: 00:00:06 2017-11-10 11:18 | epoch #13 | Collecting samples for evaluation 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:04 2017-11-10 11:18 | ----------------------- ---------------- 2017-11-10 11:18 | Epoch 13 2017-11-10 11:18 | Iteration 13 2017-11-10 11:18 | AverageReturn 5.29554 2017-11-10 11:18 | StdReturn 0.0436027 2017-11-10 11:18 | MaxReturn 5.41255 2017-11-10 11:18 | MinReturn 5.07709 2017-11-10 11:18 | AverageEsReturn 5.28534 2017-11-10 11:18 | StdEsReturn 0.104083 2017-11-10 11:18 | MaxEsReturn 5.3769 2017-11-10 11:18 | MinEsReturn 4.34235 2017-11-10 11:18 | AverageDiscountedReturn 5.13451 2017-11-10 11:18 | AverageAction 2.03434e+06 2017-11-10 11:18 | QFunRegParamNorm 48.083 2017-11-10 11:18 | AverageQLoss 1.06782e+06 2017-11-10 11:18 | AverageQ 4103.61 2017-11-10 11:18 | AverageAbsQ 4103.61 2017-11-10 11:18 | AverageY 4210.98 2017-11-10 11:18 | AverageAbsY 4210.98 2017-11-10 11:18 | AverageAbsQYDiff 491.12 2017-11-10 11:18 | AveragePolicySurr -23603.5 2017-11-10 11:18 | PolicyRegParamNorm 146.593 2017-11-10 11:18 | AveragePolicyStd 1 2017-11-10 11:18 | ----------------------- ---------------- 2017-11-10 11:18 | epoch #14 | Training started 0% [################# ] 100% | ETA: 00:00:02

Can someone explain to me why this is happening? Every algo in launcher_stub_utils based on ddpg.py has this problem, while trpo.py, vpg.py is converging properly and faster. Thanks in advance to everyone that will help me!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/shaneshixiang/rllabplusplus/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AAU8fas6ORsSgEFD-kJSmCm4PitVUPOrks5s1CbqgaJpZM4QZXgz .

-- Shane Gu

Whoops, no way i missed that.. Now it's working properly... In launcher_utils the policy output nonlinearity was set by default to None. My bad. I hope this will be helpful for anybody else playing with DDPG.

The average returns for the Hopper are quite unstable with DDPG, yet this is a known fact (https://arxiv.org/pdf/1708.04133.pdf)

As for the performance, the training part of the algorithm takes about roughly 30 seconds, almost 10x TRPO's.


2017-11-10 14:29 | Warning: skipping Gym environment monitoring since snapshot_dir not configured.
2017-11-10 14:29 | observation space: Box(11,)
2017-11-10 14:29 | action space: Box(3,)
Overwrite C:/data/local/default/Hopper-v1-5000--an-ddpg--lr-0-001--pbs-32--phn-tanh--phs-400x300--pon-tanh--psl-True--pur-1-0--put-True--qbs-32--qhn-relu--qhs-32x32--qlr-0-001--qmr-0--qrp-0--qut-True--sr-0-01--ur-1-0--s-1?: (yes/no)yes
[get_policy] Instantiating DeterministicMLPPolicy, with sizes=[400, 300], hidden_nonlinearity=tanh.
[get_policy] output_nonlinearity=tanh.
[get_baseline] Instantiating None.
[get_qf] Instantiating ContinuousMLPQFunction, with sizes=[32, 32], hidden_nonlinearity=relu.
[get_es] Instantiating OUStrategy.
Creating algo=ddpg with n_itr=2000, max_path_length=1000...
[get_algo] Instantiating DDPG.
using seed 1
2017-11-10 14:30 | Setting seed to 1
using seed 1
2017-11-10 14:30 | observation space: Box(11,)
2017-11-10 14:30 | action space: Box(3,)
2017-11-10 14:30 | Populating workers...
2017-11-10 14:30 | Populated
2017-11-10 14:30 | [init_opt] using target qf.
2017-11-10 14:30 | [init_opt] using target policy.
2017-11-10 14:30 | No checkpoint C:/data/local/default/Hopper-v1-5000--an-ddpg--lr-0-001--pbs-32--phn-tanh--phs-400x300--pon-tanh--psl-True--pur-1-0--put-True--qbs-32--qhn-relu--qhs-32x32--qlr-0-001--qmr-0--qrp-0--qut-True--sr-0-01--ur-1-0--s-1\params.chk
2017-11-10 14:30 | Critic batch size=32, Actor batch size=32
2017-11-10 14:30 | epoch #0 | Training started
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 14:30 | epoch #0 | Training finished
Total time elapsed: 00:00:25
2017-11-10 14:30 | epoch #0 | Trained qf 4000 steps, policy 4000 steps
2017-11-10 14:30 | epoch #0 | Collecting samples for evaluation
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03
2017-11-10 14:30 | -----------------------  ------------
2017-11-10 14:30 | Epoch                      0
2017-11-10 14:30 | Iteration                  0
2017-11-10 14:30 | AverageReturn             48.8828
2017-11-10 14:30 | StdReturn                  0.893547
2017-11-10 14:30 | MaxReturn                 51.6671
2017-11-10 14:30 | MinReturn                 47.1554
2017-11-10 14:30 | AverageEsReturn           21.8725
2017-11-10 14:30 | StdEsReturn               32.4687
2017-11-10 14:30 | MaxEsReturn              175.577
2017-11-10 14:30 | MinEsReturn               -1.97369
2017-11-10 14:30 | AverageDiscountedReturn   41.3946
2017-11-10 14:30 | AverageAction              0.54036
2017-11-10 14:30 | QFunRegParamNorm           7.27332
2017-11-10 14:30 | AverageQLoss               0.0200934
2017-11-10 14:30 | AverageQ                  -0.133591
2017-11-10 14:30 | AverageAbsQ                0.191488
2017-11-10 14:30 | AverageY                  -0.13435
2017-11-10 14:30 | AverageAbsY                0.194736
2017-11-10 14:30 | AverageAbsQYDiff           0.0622197
2017-11-10 14:30 | AveragePolicySurr          0.00153158
2017-11-10 14:30 | PolicyRegParamNorm        27.8091
2017-11-10 14:30 | -----------------------  ------------
2017-11-10 14:30 | epoch #1 | Training started
0% [############################# ] 100% | ETA: 00:00:012017-11-10 14:31 | epoch #1 | Training finished
2017-11-10 14:31 | epoch #1 | Trained qf 5000 steps, policy 5000 steps
2017-11-10 14:31 | epoch #1 | Collecting samples for evaluation
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:31
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03
2017-11-10 14:31 | -----------------------  ------------
2017-11-10 14:31 | Epoch                      1
2017-11-10 14:31 | Iteration                  1
2017-11-10 14:31 | AverageReturn             79.6939
2017-11-10 14:31 | StdReturn                  1.04619
2017-11-10 14:31 | MaxReturn                 81.7443
2017-11-10 14:31 | MinReturn                 77.0883
2017-11-10 14:31 | AverageEsReturn           63.9032
2017-11-10 14:31 | StdEsReturn               26.6332
2017-11-10 14:31 | MaxEsReturn              148.647
2017-11-10 14:31 | MinEsReturn               15.5738
2017-11-10 14:31 | AverageDiscountedReturn   63.5839
2017-11-10 14:31 | AverageAction              0.571644
2017-11-10 14:31 | QFunRegParamNorm           7.58411
2017-11-10 14:31 | AverageQLoss               0.00148588
2017-11-10 14:31 | AverageQ                   0.235749
2017-11-10 14:31 | AverageAbsQ                0.242382
2017-11-10 14:31 | AverageY                   0.23594
2017-11-10 14:31 | AverageAbsY                0.242601
2017-11-10 14:31 | AverageAbsQYDiff           0.025429
2017-11-10 14:31 | AveragePolicySurr         -0.300727
2017-11-10 14:31 | PolicyRegParamNorm        42.8005
2017-11-10 14:31 | -----------------------  ------------
2017-11-10 14:31 | epoch #2 | Training started
0% [##############################] 100% | ETA: 00:00:00
2017-11-10 14:31 | epoch #2 | Training finished
Total time elapsed: 00:00:31
2017-11-10 14:31 | epoch #2 | Trained qf 5000 steps, policy 5000 steps
2017-11-10 14:31 | epoch #2 | Collecting samples for evaluation
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03

It is normal?
I fear something is wrong with my parameters again :(

# environment params
flags.DEFINE_string('env_name', 'Hopper-v1', 'Environment.')
flags.DEFINE_float('discount', 0.99, 'Discount.')

# learning params
flags.DEFINE_float('learning_rate', 0.001, 'Base learning rate.')
flags.DEFINE_integer('batch_size', 5000, 'Batch size.')
flags.DEFINE_string('algo_name', 'ddpg', 'RLAlgorithm.')
flags.DEFINE_integer('seed', 1, 'Seed.')
flags.DEFINE_integer('max_episode', 10000, 'Max episodes.')
flags.DEFINE_boolean('normalize_obs', False, 'Normalize observations.')
flags.DEFINE_boolean('recurrent', False, 'Recurrent policy.')
flags.DEFINE_string('policy_hidden_sizes', '400x300', 'Sizes of policy hidden layers.')
flags.DEFINE_string('qf_hidden_sizes', '32x32', 'Sizes of qf hidden layers.')
flags.DEFINE_string('policy_hidden_nonlinearity', 'tanh', 'hidden nonlinearity for policy.')
flags.DEFINE_string('policy_output_nonlinearity', 'tanh', 'output nonlinearity for policy.')
flags.DEFINE_string('qf_hidden_nonlinearity', 'relu', 'Hidden nonlinearity for qf.')
flags.DEFINE_boolean('policy_use_target', True, 'Use target policy.')
flags.DEFINE_boolean('qf_use_target', True, 'Use target qf')

# batchopt params
flags.DEFINE_float('gae_lambda', 0.97, 'Generalized advantage estimation lambda.')
flags.DEFINE_string('baseline_cls', 'linear', 'Baseline class.')
flags.DEFINE_string('baseline_hidden_sizes', '50x50', 'Baseline network hidden sizes.')

# trpo params
flags.DEFINE_float('step_size', 0.01, 'Step size for TRPO.')
flags.DEFINE_integer('sample_backups', 0, 'Backup off-policy samples for Q-prop est.')
flags.DEFINE_integer('kl_sample_backups', 0, 'Backup off-policy samples for KL est.')

# ddpg params
flags.DEFINE_float('scale_reward', 0.01, 'Scale reward for Q-learning.')
flags.DEFINE_float('policy_updates_ratio', 1.0, 'Policy updates per critic update for DDPG.')
flags.DEFINE_integer('replay_pool_size', 5000, 'Batch size during Q-prop.')
flags.DEFINE_float('replacement_prob', 1.0, 'Replacement probability.')
flags.DEFINE_float('qf_learning_rate', 1e-3, 'Learning rate for Qfunction.')
flags.DEFINE_float('updates_ratio', 1.0, 'Updates per actor experience.')
flags.DEFINE_integer('policy_batch_size', 32, 'Batch size for policy update.')
flags.DEFINE_boolean('policy_sample_last', True, 'Sample most recent batch for policy update.')
flags.DEFINE_integer('qf_batch_size', 32, 'Qf batch size.')
flags.DEFINE_float('qf_mc_ratio', 0, 'Ratio of MC regression objective for fitting Q function.')
flags.DEFINE_float('qf_residual_phi', 0, 'Phi interpolating direct method and residual gradient method.')

Thank you so much @shaneshixiang , your help is gold!!!

Hi,

I only see runs until 2 epochs so I cannot judge if the results are good. DDPG is indeed quite unstable for Hopper domain. Possibly DDPG + parameter noise version may work better (not in my code, but here: https://github.com/openai/baselines). Often OU exploration heuristic seems to hurt a lot in some domains.

I have considered updating codes to make right defaults for DDPG. The tricky part is that my code relies a lot on reusing flag arguments, so policy network for Q-prop vs DDPG is specified by the same one, leading to incorrect defaults for DDPG case.

On Fri, Nov 10, 2017 at 1:40 PM, SystemSigma notifications@github.com wrote:

Whoops, no way i missed that.. Now it's working properly... In launcher_utils the policy output nonlinearity was set by default to None. My bad. I hope this will be helpful for anybody else playing with DDPG.

The average returns for the Hopper are quite unstable with DDPG, yet this is a known fact (https://arxiv.org/pdf/1708.04133.pdf)

As for the performance, the training part of the algorithm takes about roughly 30 seconds, almost 10x TRPO's.

2017-11-10 14:29 | Warning: skipping Gym environment monitoring since snapshot_dir not configured. 2017-11-10 14:29 | observation space: Box(11,) 2017-11-10 14:29 | action space: Box(3,) Overwrite C:/data/local/default/Hopper-v1-5000--an-ddpg--lr-0-001--pbs-32--phn-tanh--phs-400x300--pon-tanh--psl-True--pur-1-0--put-True--qbs-32--qhn-relu--qhs-32x32--qlr-0-001--qmr-0--qrp-0--qut-True--sr-0-01--ur-1-0--s-1?: (yes/no)yes [get_policy] Instantiating DeterministicMLPPolicy, with sizes=[400, 300], hidden_nonlinearity=tanh. [get_policy] output_nonlinearity=tanh. [get_baseline] Instantiating None. [get_qf] Instantiating ContinuousMLPQFunction, with sizes=[32, 32], hidden_nonlinearity=relu. [get_es] Instantiating OUStrategy. Creating algo=ddpg with n_itr=2000, max_path_length=1000... [get_algo] Instantiating DDPG. using seed 1 2017-11-10 14:30 | Setting seed to 1 using seed 1 2017-11-10 14:30 | observation space: Box(11,) 2017-11-10 14:30 | action space: Box(3,) 2017-11-10 14:30 | Populating workers... 2017-11-10 14:30 | Populated 2017-11-10 14:30 | [init_opt] using target qf. 2017-11-10 14:30 | [init_opt] using target policy. 2017-11-10 14:30 | No checkpoint C:/data/local/default/Hopper-v1-5000--an-ddpg--lr-0-001--pbs-32--phn-tanh--phs-400x300--pon-tanh--psl-True--pur-1-0--put-True--qbs-32--qhn-relu--qhs-32x32--qlr-0-001--qmr-0--qrp-0--qut-True--sr-0-01--ur-1-0--s-1\params.chk 2017-11-10 14:30 | Critic batch size=32, Actor batch size=32 2017-11-10 14:30 | epoch #0 | Training started 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 14:30 | epoch #0 | Training finished Total time elapsed: 00:00:25 2017-11-10 14:30 | epoch #0 | Trained qf 4000 steps, policy 4000 steps 2017-11-10 14:30 | epoch #0 | Collecting samples for evaluation 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:03 2017-11-10 14:30 | ----------------------- ------------ 2017-11-10 14:30 | Epoch 0 2017-11-10 14:30 | Iteration 0 2017-11-10 14:30 | AverageReturn 48.8828 2017-11-10 14:30 | StdReturn 0.893547 2017-11-10 14:30 | MaxReturn 51.6671 2017-11-10 14:30 | MinReturn 47.1554 2017-11-10 14:30 | AverageEsReturn 21.8725 2017-11-10 14:30 | StdEsReturn 32.4687 2017-11-10 14:30 | MaxEsReturn 175.577 2017-11-10 14:30 | MinEsReturn -1.97369 2017-11-10 14:30 | AverageDiscountedReturn 41.3946 2017-11-10 14:30 | AverageAction 0.54036 2017-11-10 14:30 | QFunRegParamNorm 7.27332 2017-11-10 14:30 | AverageQLoss 0.0200934 2017-11-10 14:30 | AverageQ -0.133591 2017-11-10 14:30 | AverageAbsQ 0.191488 2017-11-10 14:30 | AverageY -0.13435 2017-11-10 14:30 | AverageAbsY 0.194736 2017-11-10 14:30 | AverageAbsQYDiff 0.0622197 2017-11-10 14:30 | AveragePolicySurr 0.00153158 2017-11-10 14:30 | PolicyRegParamNorm 27.8091 2017-11-10 14:30 | ----------------------- ------------ 2017-11-10 14:30 | epoch #1 | Training started 0% [############################# ] 100% | ETA: 00:00:012017-11-10 14:31 | epoch #1 | Training finished 2017-11-10 14:31 | epoch #1 | Trained qf 5000 steps, policy 5000 steps 2017-11-10 14:31 | epoch #1 | Collecting samples for evaluation 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:31 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:03 2017-11-10 14:31 | ----------------------- ------------ 2017-11-10 14:31 | Epoch 1 2017-11-10 14:31 | Iteration 1 2017-11-10 14:31 | AverageReturn 79.6939 2017-11-10 14:31 | StdReturn 1.04619 2017-11-10 14:31 | MaxReturn 81.7443 2017-11-10 14:31 | MinReturn 77.0883 2017-11-10 14:31 | AverageEsReturn 63.9032 2017-11-10 14:31 | StdEsReturn 26.6332 2017-11-10 14:31 | MaxEsReturn 148.647 2017-11-10 14:31 | MinEsReturn 15.5738 2017-11-10 14:31 | AverageDiscountedReturn 63.5839 2017-11-10 14:31 | AverageAction 0.571644 2017-11-10 14:31 | QFunRegParamNorm 7.58411 2017-11-10 14:31 | AverageQLoss 0.00148588 2017-11-10 14:31 | AverageQ 0.235749 2017-11-10 14:31 | AverageAbsQ 0.242382 2017-11-10 14:31 | AverageY 0.23594 2017-11-10 14:31 | AverageAbsY 0.242601 2017-11-10 14:31 | AverageAbsQYDiff 0.025429 2017-11-10 14:31 | AveragePolicySurr -0.300727 2017-11-10 14:31 | PolicyRegParamNorm 42.8005 2017-11-10 14:31 | ----------------------- ------------ 2017-11-10 14:31 | epoch #2 | Training started 0% [##############################] 100% | ETA: 00:00:00 2017-11-10 14:31 | epoch #2 | Training finished Total time elapsed: 00:00:31 2017-11-10 14:31 | epoch #2 | Trained qf 5000 steps, policy 5000 steps 2017-11-10 14:31 | epoch #2 | Collecting samples for evaluation 0% [##############################] 100% | ETA: 00:00:00 Total time elapsed: 00:00:03

It is normal? I fear something is wrong with my parameters again :(

environment params

flags.DEFINE_string('env_name', 'Hopper-v1', 'Environment.') flags.DEFINE_float('discount', 0.99, 'Discount.')

learning params

flags.DEFINE_float('learning_rate', 0.001, 'Base learning rate.') flags.DEFINE_integer('batch_size', 5000, 'Batch size.') flags.DEFINE_string('algo_name', 'ddpg', 'RLAlgorithm.') flags.DEFINE_integer('seed', 1, 'Seed.') flags.DEFINE_integer('max_episode', 10000, 'Max episodes.') flags.DEFINE_boolean('normalize_obs', False, 'Normalize observations.') flags.DEFINE_boolean('recurrent', False, 'Recurrent policy.') flags.DEFINE_string('policy_hidden_sizes', '400x300', 'Sizes of policy hidden layers.') flags.DEFINE_string('qf_hidden_sizes', '32x32', 'Sizes of qf hidden layers.') flags.DEFINE_string('policy_hidden_nonlinearity', 'tanh', 'hidden nonlinearity for policy.') flags.DEFINE_string('policy_output_nonlinearity', 'tanh', 'output nonlinearity for policy.') flags.DEFINE_string('qf_hidden_nonlinearity', 'relu', 'Hidden nonlinearity for qf.') flags.DEFINE_boolean('policy_use_target', True, 'Use target policy.') flags.DEFINE_boolean('qf_use_target', True, 'Use target qf')

batchopt params

flags.DEFINE_float('gae_lambda', 0.97, 'Generalized advantage estimation lambda.') flags.DEFINE_string('baseline_cls', 'linear', 'Baseline class.') flags.DEFINE_string('baseline_hidden_sizes', '50x50', 'Baseline network hidden sizes.')

trpo params

flags.DEFINE_float('step_size', 0.01, 'Step size for TRPO.') flags.DEFINE_integer('sample_backups', 0, 'Backup off-policy samples for Q-prop est.') flags.DEFINE_integer('kl_sample_backups', 0, 'Backup off-policy samples for KL est.')

ddpg params

flags.DEFINE_float('scale_reward', 0.01, 'Scale reward for Q-learning.') flags.DEFINE_float('policy_updates_ratio', 1.0, 'Policy updates per critic update for DDPG.') flags.DEFINE_integer('replay_pool_size', 5000, 'Batch size during Q-prop.') flags.DEFINE_float('replacement_prob', 1.0, 'Replacement probability.') flags.DEFINE_float('qf_learning_rate', 1e-3, 'Learning rate for Qfunction.') flags.DEFINE_float('updates_ratio', 1.0, 'Updates per actor experience.') flags.DEFINE_integer('policy_batch_size', 32, 'Batch size for policy update.') flags.DEFINE_boolean('policy_sample_last', True, 'Sample most recent batch for policy update.') flags.DEFINE_integer('qf_batch_size', 32, 'Qf batch size.') flags.DEFINE_float('qf_mc_ratio', 0, 'Ratio of MC regression objective for fitting Q function.') flags.DEFINE_float('qf_residual_phi', 0, 'Phi interpolating direct method and residual gradient method.')

Thank you so much, your help is gold!!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/shaneshixiang/rllabplusplus/issues/6#issuecomment-343476027, or mute the thread https://github.com/notifications/unsubscribe-auth/AAU8fZnKegGMLli6f228OGzJgZsPLGb1ks5s1FJrgaJpZM4QZXgz .

-- Shane Gu

Yeah, i noticed that in launcher_stub_utils. I think the results are good.. However i may post a full debug.log with many epochs if it may help diagnosing the algo setup.

I was only wandering the reason for such a difference in training time per epoch for DDPG with respect to other algos.

I feel quite crippled by this since finding the right algo and domain hyperparameters may require a lot of trials, thus time... even on my i7 4770 (4 cores/8 threads CPU). I'm considering to run my experiments on an ec2 instance..

@shaneshixiang it is correct a batch_size=4000 for ddpg algorithm? I think this is the main problem of my slow training.. Clearly I don't understand the difference between replay_pool_size and batch_size in your code... policy_batch_size and qf_batch_size should be the sizes of mini batches randomly sampled from the replay pool right?

DDPG is very slow because by default it does 1 update per step, while on-policy algorithms do 1 update per batch (say 5000 steps).

On Thu, Nov 16, 2017 at 3:31 PM, SystemSigma notifications@github.com wrote:

@shaneshixiang https://github.com/shaneshixiang it is correct a batch size of 4000 for ddpg algorithm? I think this is the main problem of my slow training.. Maybe I don't understand the difference between replay_pool_size and batch size in your code... policy_batch_size and qf_batch_size should be the sizes of mini batches randomly sampled from the replay pool right?

Replay pool size is size of replay buffer Qf batch size is minibatch size sampled from replay to train for Q function in DDPG or Q-prop policy batch size is minibatch size sampled from replay to train for policy in DDPG (this is because for trust-region variant of DDPG, I use larger batch with small number of updates).

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/shaneshixiang/rllabplusplus/issues/6#issuecomment-344958750, or mute the thread https://github.com/notifications/unsubscribe-auth/AAU8fcoH7-ElW4lDWQWA9bCxAW7GCq3Lks5s3FVggaJpZM4QZXgz .

-- Shane Gu

Thanks for clarifications!!

rlbayes / rllabplusplus

DDPG not converging/slow training? #6

environment params

learning params

batchopt params

trpo params

ddpg params