rll / rllab

rllab is a framework for developing and evaluating reinforcement learning algorithms, fully compatible with OpenAI Gym.
Other
2.88k stars 803 forks source link

DDPG parameters #146

Open anaypat opened 7 years ago

anaypat commented 7 years ago

Can you please let us know the parameters used for DDPG algorithm for reproducing results as given in "Benchmarking Deep Reinforcement Learning for Continuous Control" paper (https://arxiv.org/pdf/1604.06778.pdf). I couldn't find them in supplementary material.

cannontwo commented 7 years ago

Check here: https://arxiv.org/abs/1509.02971

anaypat commented 7 years ago

Thanks! However, I couldn't find some parameters such as epoch_length, max_path_length, n_epochs etc

It has been mentioned in https://arxiv.org/abs/1509.02971 (Section 7) that "Actions were not included until the 2nd hidden layer of Q" . In rllab's DeterministicMLPPolicy class, I couldn't find where the actions were injected into Q network.

It seems that the critic network architecture is different. Am I missing something?

cannontwo commented 7 years ago

I'm not sure what you mean by referring to the DeterministicMLPPolicy, as the code implementing DDPG is here: https://github.com/openai/rllab/blob/master/rllab/algos/ddpg.py. Hopefully that helps.

anaypat commented 7 years ago

Sorry, I previously referred to actor network instead of the critic network. I should have mentioned https://github.com/openai/rllab/blob/master/rllab/q_functions/continuous_mlp_q_function.py . rllab has indeed merged the action into second hidden layer of critic.

btw, I still couldn't get hold of epoch_length, max_path_length, n_epochs etc

cannontwo commented 7 years ago

I'm not sure if there are additional parameters implied by your "etc" that are especially important to you, but epoch_length, max_path_length, and n_epochs should have minimal effect on training your RL agent. For DDPG I typically use ~5000 epochs by default, with an epoch_length/max_path_length of 1000, but you will likely want to adjust this to better fit the precise environments that you are working with.

You may notice that in the paper at https://arxiv.org/pdf/1509.02971.pdf the authors report their results in terms of the number of steps of training, which is the important metric for training results. So long as your epochs are not trivially short, you should have a wide margin for the specific parameters (epoch_length, max_path_length, n_epochs) that you mention.

Much more relevant to replicating the authors' results are the parameters to the actor and critic networks and discount values for reinforcement learning, which are reported in the paper linked above.

anaypat commented 7 years ago

This is the code that I used for half cheetah task

from rllab.algos.ddpg import DDPG
from rllab.envs.mujoco.half_cheetah_env import HalfCheetahEnv
from rllab.envs.normalized_env import normalize
from rllab.misc.instrument import run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction

def run_task(*_):
    env = normalize(HalfCheetahEnv())

    policy = DeterministicMLPPolicy(
        env_spec=env.spec,
        # The neural network policy should have two hidden layers.
        hidden_sizes=(400, 300)
    )

    es = OUStrategy(env_spec=env.spec)

    qf = ContinuousMLPQFunction(env_spec=env.spec)

    algo = DDPG(
        env=env,
        policy=policy,
        es=es,
        qf=qf,
        batch_size=64,
        max_path_length=500,
        epoch_length=1000,
        min_pool_size=10000,
        n_epochs=3000,
        discount=0.99,
        scale_reward=0.1,
        qf_learning_rate=1e-3,
        policy_learning_rate=1e-4,
        # Uncomment both lines (this and the plot parameter below) to enable plotting
        # plot=True,
    )
    algo.train()

run_experiment_lite(
    run_task,
    # Number of parallel workers for sampling
    n_parallel=4,
    # Only keep the snapshot parameters for the last iteration
    snapshot_mode="last",
    # Specifies the seed for the experiment. If this is not provided, a random seed
    # will be used
    seed=1,
    # plot=True,
) 

progress_half_cheetah

This is the average return per epoch graph. Seems like it doesn't learn

As you might see I ran it for around 1600 epochs with 1000 steps per epoch. That is around 1.6 million steps. In https://arxiv.org/abs/1509.02971 , Fig. 2, the agent learns well before 1 million steps.

Can you please suggest what might be the issue? Seems that I need to change some parameters. As you had suggested earlier, I have kept the architecture and parameters as well as update rule same as https://arxiv.org/abs/1509.02971 (btw, this was the default setting).

dementrock commented 7 years ago

Can you try setting the hidden_sizes of ContinuousMLPQFunction to also (400, 300)?

anaypat commented 7 years ago

Thanks for pointing it out, @dementrock !

it_works

anaypat commented 7 years ago

Just out of curiosity, the results given in https://arxiv.org/abs/1509.02971 (Figure 2) shows that the algo converges well before 1 million steps. In the above experiment, I used max_path_length=500, epoch_length=1000 and n_epochs=3000. It seems to converge at around 2000 epochs. As I understand it turns out to 500 1000 2000 = 100 million steps. Can I reduce this by appropriately fixing some of these parameters (max_path_length, epoch_length, n_epochs)? I'm asking this as it'll save me training time on other tasks as well. Please let me know if it's task dependent as well. Given below is the code that I used to produce the result in previous comment.

from rllab.algos.ddpg import DDPG
from rllab.envs.mujoco.half_cheetah_env import HalfCheetahEnv
from rllab.envs.normalized_env import normalize
from rllab.misc.instrument import run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction

def run_task(*_):
    env = normalize(HalfCheetahEnv())

    policy = DeterministicMLPPolicy(
        env_spec=env.spec,
        # The neural network policy should have two hidden layers
        hidden_sizes=(400, 300)
    )

    es = OUStrategy(env_spec=env.spec)                  

    qf = ContinuousMLPQFunction(env_spec=env.spec,
                                hidden_sizes=(400,300)
                                )

    algo = DDPG(
        env=env,
        policy=policy,
        es=es,
        qf=qf,
        batch_size=64,
        max_path_length=500,
        epoch_length=1000,
        min_pool_size=10000,
        n_epochs=3000,
        discount=0.99,
        scale_reward=0.1,
        qf_learning_rate=1e-3,
        policy_learning_rate=1e-4,
        # Uncomment both lines (this and the plot parameter below) to enable plotting
        # plot=True,
    )
    algo.train()

run_experiment_lite(
    run_task,
    # Number of parallel workers for sampling
    n_parallel=4,
    # Only keep the snapshot parameters for the last iteration
    snapshot_mode="last",
    # Specifies the seed for the experiment. If this is not provided, a random seed
    # will be used
    seed=1,
    # plot=True,
)
dementrock commented 7 years ago

epoch_length is the number of time steps per epoch rather than the number of episodes. So what you have should correspond to 1000 * 2000 = 2 million time steps. Also in the DDPG paper I think they might have used a shorter horizon, probably about 250 time steps per episode.

anaypat commented 7 years ago

@dementrock Thanks for the clarification. I'll try out shorter horizon.

atavakol commented 7 years ago

How about the qf_weight_decay? By default it's set to zero, but in the DDPG paper it's said to be 1e-2? Is there a reason for this? Same question in regards to reward scaling: do we need to use reward scaling for DDPG in RLLab and if yes, what value works well? In the DDPG paper there is no mentioning of reward scaling.

dementrock commented 7 years ago

I've found weight decay to hurt performance sometimes, but you should experiment with both. For reward scaling use 0.1.

There's no mention in DDPG paper since they implemented the environments themselves, and they can choose to scale the reward when defining the environments. However environments in rllab were implemented for policy gradient algorithms. They are batch based and can already normalize rewards within each batch.

atavakol commented 7 years ago

@dementrock I tried reward scaling of 0.1 and 1. For Reacher and Hopper I get divergence or plateauing on very bad returns. For 1.0 I was getting better results for both, but for Reacher, the agent would start at evaluation rewards of -12 and plateau on -10 or -9! Which are far from solved. Any pointers? I'm using all the parameters in the DDPG paper.

ghost commented 6 years ago

@dementrock I found using batch normalization in the policy and value networks to hinder the performance contrary to what the DeepMind paper says on all the tasks among Half Cheetah, Swimmer, Reacher and Walker2D. Any insights on why the case?

Another question I had was about include_horizon_terminal_transitions. It's set to default False in rllab. But I think DeepMind in general uses True (atleast in Atari environments)?

dementrock commented 6 years ago

@aravindsrinivas I've observed the same behavior when using batch norm. I haven't figured out why but maybe one reason is that the environments they used aren't exactly the same, and maybe their environments require more care with normalizing activations (e.g. if the inputs are in different range). Also, different parameterizations of parameters of batch norm will yield different behaviors when doing the soft target update (e.g. parameterizing variance vs. inverse of variance).

include_horizon_terminal_transitions: I haven't found this to matter. Earlier when communicating with Tim Lillicrap (1st author on DDPG paper) he indicated that they did not include such transitions.

LilianaNYC commented 5 years ago

Thanks for pointing it out, @dementrock !

it_works

Hi anaypat,

I am working on something very similar but different environment. I just were curious on how you were able to plot the average reward after training the algo. If you could perhaps point me to the right direction, I will appreciate it.

Thanks!

DongChen06 commented 5 years ago

@LilianaNYC Have you figure out how to solve this? By average random seeds?