Open anaypat opened 7 years ago
Check here: https://arxiv.org/abs/1509.02971
Thanks! However, I couldn't find some parameters such as epoch_length, max_path_length, n_epochs etc
It has been mentioned in https://arxiv.org/abs/1509.02971 (Section 7) that "Actions were not included until the 2nd hidden layer of Q" . In rllab's DeterministicMLPPolicy class, I couldn't find where the actions were injected into Q network.
It seems that the critic network architecture is different. Am I missing something?
I'm not sure what you mean by referring to the DeterministicMLPPolicy, as the code implementing DDPG is here: https://github.com/openai/rllab/blob/master/rllab/algos/ddpg.py. Hopefully that helps.
Sorry, I previously referred to actor network instead of the critic network. I should have mentioned https://github.com/openai/rllab/blob/master/rllab/q_functions/continuous_mlp_q_function.py . rllab has indeed merged the action into second hidden layer of critic.
btw, I still couldn't get hold of epoch_length, max_path_length, n_epochs etc
I'm not sure if there are additional parameters implied by your "etc" that are especially important to you, but epoch_length, max_path_length, and n_epochs should have minimal effect on training your RL agent. For DDPG I typically use ~5000 epochs by default, with an epoch_length/max_path_length of 1000, but you will likely want to adjust this to better fit the precise environments that you are working with.
You may notice that in the paper at https://arxiv.org/pdf/1509.02971.pdf the authors report their results in terms of the number of steps of training, which is the important metric for training results. So long as your epochs are not trivially short, you should have a wide margin for the specific parameters (epoch_length, max_path_length, n_epochs) that you mention.
Much more relevant to replicating the authors' results are the parameters to the actor and critic networks and discount values for reinforcement learning, which are reported in the paper linked above.
This is the code that I used for half cheetah task
from rllab.algos.ddpg import DDPG
from rllab.envs.mujoco.half_cheetah_env import HalfCheetahEnv
from rllab.envs.normalized_env import normalize
from rllab.misc.instrument import run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction
def run_task(*_):
env = normalize(HalfCheetahEnv())
policy = DeterministicMLPPolicy(
env_spec=env.spec,
# The neural network policy should have two hidden layers.
hidden_sizes=(400, 300)
)
es = OUStrategy(env_spec=env.spec)
qf = ContinuousMLPQFunction(env_spec=env.spec)
algo = DDPG(
env=env,
policy=policy,
es=es,
qf=qf,
batch_size=64,
max_path_length=500,
epoch_length=1000,
min_pool_size=10000,
n_epochs=3000,
discount=0.99,
scale_reward=0.1,
qf_learning_rate=1e-3,
policy_learning_rate=1e-4,
# Uncomment both lines (this and the plot parameter below) to enable plotting
# plot=True,
)
algo.train()
run_experiment_lite(
run_task,
# Number of parallel workers for sampling
n_parallel=4,
# Only keep the snapshot parameters for the last iteration
snapshot_mode="last",
# Specifies the seed for the experiment. If this is not provided, a random seed
# will be used
seed=1,
# plot=True,
)
This is the average return per epoch graph. Seems like it doesn't learn
As you might see I ran it for around 1600 epochs with 1000 steps per epoch. That is around 1.6 million steps. In https://arxiv.org/abs/1509.02971 , Fig. 2, the agent learns well before 1 million steps.
Can you please suggest what might be the issue? Seems that I need to change some parameters. As you had suggested earlier, I have kept the architecture and parameters as well as update rule same as https://arxiv.org/abs/1509.02971 (btw, this was the default setting).
Can you try setting the hidden_sizes of ContinuousMLPQFunction to also (400, 300)
?
Thanks for pointing it out, @dementrock !
Just out of curiosity, the results given in https://arxiv.org/abs/1509.02971 (Figure 2) shows that the algo converges well before 1 million steps. In the above experiment, I used max_path_length=500, epoch_length=1000 and n_epochs=3000. It seems to converge at around 2000 epochs. As I understand it turns out to 500 1000 2000 = 100 million steps. Can I reduce this by appropriately fixing some of these parameters (max_path_length, epoch_length, n_epochs)? I'm asking this as it'll save me training time on other tasks as well. Please let me know if it's task dependent as well. Given below is the code that I used to produce the result in previous comment.
from rllab.algos.ddpg import DDPG
from rllab.envs.mujoco.half_cheetah_env import HalfCheetahEnv
from rllab.envs.normalized_env import normalize
from rllab.misc.instrument import run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction
def run_task(*_):
env = normalize(HalfCheetahEnv())
policy = DeterministicMLPPolicy(
env_spec=env.spec,
# The neural network policy should have two hidden layers
hidden_sizes=(400, 300)
)
es = OUStrategy(env_spec=env.spec)
qf = ContinuousMLPQFunction(env_spec=env.spec,
hidden_sizes=(400,300)
)
algo = DDPG(
env=env,
policy=policy,
es=es,
qf=qf,
batch_size=64,
max_path_length=500,
epoch_length=1000,
min_pool_size=10000,
n_epochs=3000,
discount=0.99,
scale_reward=0.1,
qf_learning_rate=1e-3,
policy_learning_rate=1e-4,
# Uncomment both lines (this and the plot parameter below) to enable plotting
# plot=True,
)
algo.train()
run_experiment_lite(
run_task,
# Number of parallel workers for sampling
n_parallel=4,
# Only keep the snapshot parameters for the last iteration
snapshot_mode="last",
# Specifies the seed for the experiment. If this is not provided, a random seed
# will be used
seed=1,
# plot=True,
)
epoch_length is the number of time steps per epoch rather than the number of episodes. So what you have should correspond to 1000 * 2000 = 2 million time steps. Also in the DDPG paper I think they might have used a shorter horizon, probably about 250 time steps per episode.
@dementrock Thanks for the clarification. I'll try out shorter horizon.
How about the qf_weight_decay? By default it's set to zero, but in the DDPG paper it's said to be 1e-2? Is there a reason for this? Same question in regards to reward scaling: do we need to use reward scaling for DDPG in RLLab and if yes, what value works well? In the DDPG paper there is no mentioning of reward scaling.
I've found weight decay to hurt performance sometimes, but you should experiment with both. For reward scaling use 0.1.
There's no mention in DDPG paper since they implemented the environments themselves, and they can choose to scale the reward when defining the environments. However environments in rllab were implemented for policy gradient algorithms. They are batch based and can already normalize rewards within each batch.
@dementrock I tried reward scaling of 0.1 and 1. For Reacher and Hopper I get divergence or plateauing on very bad returns. For 1.0 I was getting better results for both, but for Reacher, the agent would start at evaluation rewards of -12 and plateau on -10 or -9! Which are far from solved. Any pointers? I'm using all the parameters in the DDPG paper.
@dementrock I found using batch normalization in the policy and value networks to hinder the performance contrary to what the DeepMind paper says on all the tasks among Half Cheetah, Swimmer, Reacher and Walker2D. Any insights on why the case?
Another question I had was about include_horizon_terminal_transitions. It's set to default False in rllab. But I think DeepMind in general uses True (atleast in Atari environments)?
@aravindsrinivas I've observed the same behavior when using batch norm. I haven't figured out why but maybe one reason is that the environments they used aren't exactly the same, and maybe their environments require more care with normalizing activations (e.g. if the inputs are in different range). Also, different parameterizations of parameters of batch norm will yield different behaviors when doing the soft target update (e.g. parameterizing variance vs. inverse of variance).
include_horizon_terminal_transitions
: I haven't found this to matter. Earlier when communicating with Tim Lillicrap (1st author on DDPG paper) he indicated that they did not include such transitions.
Thanks for pointing it out, @dementrock !
Hi anaypat,
I am working on something very similar but different environment. I just were curious on how you were able to plot the average reward after training the algo. If you could perhaps point me to the right direction, I will appreciate it.
Thanks!
@LilianaNYC Have you figure out how to solve this? By average random seeds?
Can you please let us know the parameters used for DDPG algorithm for reproducing results as given in "Benchmarking Deep Reinforcement Learning for Continuous Control" paper (https://arxiv.org/pdf/1604.06778.pdf). I couldn't find them in supplementary material.