rlworkgroup / garage

A toolkit for reproducible reinforcement learning research.
MIT License
1.87k stars 310 forks source link

Average return very low in tf.DDPG #1077

Open surbhi1944 opened 4 years ago

surbhi1944 commented 4 years ago

Formula for off_policy_method:

total_timeseteps= n_epochs n_epoch_cycles batch_size

then if n_epochs=1400 n_epoch_cycles=20 batch_size=64 min_buffer_size=10^6 then total_timesteps=1400 20 64=1,792,000

I obtained the graph as shown in figure for DDPG_Walker2d-v3. It is showing very less average return. But most of the research papers shows ~2500 average_return on 1 million timesteps. How to set the parameters to get near about results. My Code:

import gym
import tensorflow as tf
import time
from garage.experiment import run_experiment
from garage.np.exploration_strategies import OUStrategy
from garage.replay_buffer import SimpleReplayBuffer
from garage.tf.algos import DDPG
from garage.tf.envs import TfEnv
from garage.tf.experiment import LocalTFRunner
from garage.tf.policies import ContinuousMLPPolicy
from garage.tf.q_functions import ContinuousMLPQFunction
import random
from datetime import datetime, timedelta
import numpy as np
import os

def run_task(snapshot_config, *_):
    """Run task."""

    with LocalTFRunner(snapshot_config=snapshot_config) as runner:
        env=gym.make('Walker2d-v3')
        env = TfEnv(env)
        action_noise = OUStrategy(env.spec, sigma=0.2)

        policy = ContinuousMLPPolicy(env_spec=env.spec,
                                     hidden_sizes=[400, 300],
                                     hidden_nonlinearity=tf.nn.relu,
                                     output_nonlinearity=tf.nn.tanh)

        qf = ContinuousMLPQFunction(env_spec=env.spec,
                                    hidden_sizes=[400,300],
                                    hidden_nonlinearity=tf.nn.relu)

        replay_buffer = SimpleReplayBuffer(env_spec=env.spec,
                                           size_in_transitions=int(1e6),
                                           time_horizon=100)

        ddpg = DDPG(env_spec=env.spec,
                    policy=policy,
                    policy_lr=1e-4,
                    qf_lr=1e-3,
                    qf=qf,
                    replay_buffer=replay_buffer,
                    target_update_tau=1e-3,
                    n_train_steps=50,
                    discount=0.99,
                    buffer_batch_size=64,
                    n_epoch_cycles=20,
                    min_buffer_size=int(1e6),
                    exploration_strategy=action_noise,
                    policy_optimizer=tf.train.AdamOptimizer,
                    qf_weight_decay=0.01,
                    qf_optimizer=tf.train.AdamOptimizer)

        runner.setup(algo=ddpg, env=env)

        runner.train(n_epochs=2000, n_epoch_cycles=20, batch_size=64)

sed=[21]
for difsed in range(1):
    i=0
    seed=sed[difsed]
    #for i in range(2):
    start_time=time.time()
    run_experiment(
        run_task,
        snapshot_mode='last',
        seed=seed,
        exp_name=str(seed)+"_"+str(i),
        log_dir=r"/home/surabhi/Downloads/github/garage/result/ddpg/walk-v22/"+str(seed)+"/"+str(i)+"/"
    )
        #print("Time: ",timedelta(seconds=time.time()-start_time))
    file=open(r"/home/surabhi/Downloads/github/garage/result/ddpg/walk-v22/time.txt","a")
    file.write('seed '+str(seed)+' itr '+str(i)+' start '+str(start_time)+' elapsed '+str(timedelta(seconds=time.time()-start_time))+"\n")
    file.close()

f

I found that it starts evaluation from 782epoch ( bcz 1000000//(20*64) ). Hence https://github.com/rlworkgroup/garage/blob/master/src/garage/tf/algos/ddpg.py condition on line 272 will be true from this epoch and policy optimization will be started from this point onward. Is this the reason for not getting results?

Another question i want to ask is: why this evaluation loop (line 271 of DDPG.py) is repeated n_train_steps (training steps) times? what is the purpose of evaluating n_train_steps times. Is this doing a kind of rollout repeated n_train_steps times? Where length of rollout is either end of episode or a trajectory of length =batch_size=64. (Line 173 of https://github.com/rlworkgroup/garage/blob/master/src/garage/sampler/off_policy_vectorized_sampler.py#L66 ?

krzentner commented 4 years ago

Hi Surbhi1944, thanks for opening this issue.

Optimization starts at epoch 782 because you've set min_buffer_size to 1000000. Usually, when people are bechmarking this task, they set this parameter much lower. For example, our bechmarks set it to 10000 for all Mujoco tasks. I believe that this is why you're seeing such low performance.

About your other question, n_train_steps is a parameter we use so that we can make our epoch size the same as other implementations, which we are working on removing. Soon, we will be logging performance based on number of time steps, which should make it easier to compare performance.

I do believe that our implementations of DDPG should perform much better than this, as indicated by this benchmark result below. If you find that it doesn't after changing min_buffer_size, then I can look into it further. By the way, in which papers do you see DDPG get an average return of 2500 after 1M time steps on Walker2d? At least in Soft Actor-Critic Algorithms and Applications and in Deep Reinforcement Learning that Matters, the expected average return is a little over 1000.

Hopefully that answers your questions, but please let me know if you have any others or there was something I missed.

surbhi1944 commented 4 years ago

Thanks for the reply.

Now i got the purpose of min_buffer_size variable. But still confused for n_train_steps. I think this is representing the number of times we want the updation in the weights of neural network (1 time for 1 batch). As more its value means more time optimize_policy function (line275 of https://github.com/rlworkgroup/garage/blob/master/src/garage/tf/algos/ddpg.py) will be called. Hence more time the weights will be updated. Please correct me if i am wrong.

If we want to do training on environment such as Humanoid (that need ~10million timesteps) then should we increase the n_train_step or only n_epoch and n_epoch_cycle

I saw: 1) ~2000 average return of DDPG on Walker2d in 1million timesteps in the research paper titled "Addressing Function Approximation Error in Actor-Critic Methods".

2) ~1800 average episodic return in 1 million timesteps in research paper titled: "Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning".

3) ~1800 average return in 1 million timesteps in the below link: https://spinningup.openai.com/en/latest/spinningup/bench.html#id10 image

image

The results presented in the graph shown in your graph(above) also does not reach to even near about and converges at ~220 (garage_tf_trial1_seed30). Is there any formula to make comparison of ~220 (yours) to ~1000 (others). or what is reported in the above graph: return of single episode, return over multiple batches, average return over previous 100 episode, average return over 1000 timesteps, or other thing?

avnishn commented 4 years ago

Hello all, it seems that we have an issue in the way that we log average returns over time. The issue seems to be over here:

https://github.com/rlworkgroup/garage/blob/1def65424ba67988f1d7fe7a03bc6d5ec8e80eef/src/garage/torch/algos/ddpg.py#L144

Over time essentially we're computing an average over all the returns that we our sampler has observed from rolling out the policy. We should only be calculating average returns over a certain period of recent training epochs (30 or 100, not all 500-1000?)

We'll make a fix and re run our baselines to verify that this is the case. Thank you @surbhi1944 .

ryanjulian commented 4 years ago

Quick update -- we were able to confirm your report and found lackluster performance in our tf/DDPG implementation. We are now auditing our implementation and making this fix the highest priority.

We're also benchmarking our torch/DDPG implementation to confirm whether or not the bug is shared.

We'll keep this issue updated.