rail-berkeley / rlkit

Collection of reinforcement learning algorithms
MIT License
2.46k stars 550 forks source link

Some benchmarks on six MuJoCo-v2 environments for DDPG and TD3 #63

Open DanielTakeshi opened 5 years ago

DanielTakeshi commented 5 years ago

Hi @vitchyr

Thanks for the great code base. I was recently benchmarking some results here in search for some DDPG/TD3 implementations after my failure to get baselines working. I thought I'd share some results in case it would be useful to you or others.

For installation, I actually didn't entirely follow the installation instructions, but here's what I did:

I took the master branch from https://github.com/vitchyr/rlkit/commit/5565dd589c54f3ee5add28183dd28f0e9663130f and then adjusted the examples/td3.py and examples/ddpg.py so that they also imported other MuJoCo environments. In addition, for TD3 only, I adjusted the hyperparameters in the "algorithm_kwargs" so that they matched DDPG in the main method. To be clear, DDPG uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L71-L79

And TD3 uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/td3.py#L104-L111

I simply modified the td3.py script so that all hyperparameters above match DDPG, so in particular I changed: number of epochs to 1000, eval steps to 1000, min steps before training to 10k, and batch size to 128.

If I am not mistaken, this should mean that both the exploration and evaluation policies will experience 1 million total steps over the course of training. Though, because evaluation by default will discard incomplete trajectories, sometimes the actual number of steps reported by the debugger will be less than 1 million.

I ran DDPG and TD3 on six MuJoCo-v2 environments, for four random seeds each. I adjusted the code so my directory structure looks like this:

$ ls -lh data/
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Walker2d-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Walker2d-v2
$ ls -lh data/rlkit-ddpg-Ant-v2/
drwxrwxr-x 2 daniel daniel 4.0K Jun 20 20:49 rlkit-ddpg-Ant-v2_2019_06_20_20_49_44_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_20_53_49_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_44_22_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_49_37_0000--s-0
$ 

// other env results presented in a similar manner

For this I used the following plotting script where I just call it like python [script].py Ant-v2 and similarly for the other environments:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import pandas as pd
import os
import numpy as np
from os.path import join

# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
error_region_alpha = 0.25

def smoothed(x, w):
    """Smooth x by averaging over sliding windows of w, assuming sufficient length.
    """
    if len(x) <= w:
        return x
    smooth = []
    for i in range(1, w):
        smooth.append( np.mean(x[0:i]) )
    for i in range(w, len(x)+1):
        smooth.append( np.mean(x[i-w:i]) )
    assert len(x) == len(smooth), "lengths: {}, {}".format(len(x), len(smooth))
    return np.array(smooth)

def plot(args):
    """Load the progress csv file, and plot.

    Plot:
      'exploration/Returns Mean',
      'exploration/num steps total',
      'evaluation/Returns Mean',
      'evaluation/num steps total',
    """
    nrows, ncols = 1, 2
    fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
                           figsize=(11*ncols,6*nrows))

    algorithms = sorted([x for x in os.listdir('data/') if args.env in x])
    assert len(algorithms) == 2
    colors = ['blue', 'red']

    for idx,alg in enumerate(algorithms):
        print('Currently on algorithm: ', alg)
        alg_dir = join('data', alg)
        progfiles = sorted([
                join(alg_dir, x, 'progress.csv') for x in os.listdir(alg_dir)
        ])
        expl_returns = []
        eval_returns = []
        expl_steps = []
        eval_steps = []

        for prog in progfiles:
            df = pd.read_csv(prog, delimiter = ',')

            expl_ret = df['exploration/Returns Mean'].tolist()
            expl_returns.append(expl_ret)
            eval_ret = df['evaluation/Returns Mean'].tolist()
            eval_returns.append(eval_ret)

            expl_sp = df['exploration/num steps total'].tolist()
            expl_steps.append(expl_sp)
            eval_sp = df['evaluation/num steps total'].tolist()
            eval_steps.append(eval_sp)

        expl_returns = np.array(expl_returns)
        eval_returns = np.array(eval_returns)
        xs = expl_returns.shape[1]
        expl_ret_mean = np.mean(expl_returns, axis=0)
        eval_ret_mean = np.mean(eval_returns, axis=0)
        expl_ret_std = np.mean(expl_returns, axis=0)
        eval_ret_std = np.mean(eval_returns, axis=0)

        w = 10
        label0 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(expl_ret_mean[-w:]))
        label1 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(eval_ret_mean[-w:]))
        ax[0,0].plot(np.arange(xs), smoothed(expl_ret_mean, w=w),
                     color=colors[idx], label=label0)
        ax[0,1].plot(np.arange(xs), smoothed(eval_ret_mean, w=w),
                     color=colors[idx], label=label1)

        # This can be noisy.
        if False:
            ax[0,0].fill_between(np.arange(xs),
                                 expl_ret_mean-expl_ret_std,
                                 expl_ret_mean+expl_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])
            ax[0,1].fill_between(np.arange(xs),
                                 eval_ret_mean-eval_ret_std,
                                 eval_ret_mean+eval_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])

    for i in range(2):
        ax[0,i].tick_params(axis='x', labelsize=ticksize)
        ax[0,i].tick_params(axis='y', labelsize=ticksize)
        leg = ax[0,i].legend(loc="best", ncol=1, prop={'size':legendsize})
        for legobj in leg.legendHandles:
            legobj.set_linewidth(5.0)
    ax[0,0].set_title('{} (Exloration)'.format(args.env), fontsize=ysize)
    ax[0,1].set_title('{} (Evaluation)'.format(args.env), fontsize=ysize)

    plt.tight_layout()
    figname = 'fig-{}.png'.format(args.env)
    plt.savefig(figname)
    print("\nJust saved: {}".format(figname))

if __name__ == "__main__":
    pp = argparse.ArgumentParser()
    pp.add_argument('env', type=str)
    args = pp.parse_args()
    plot(args)

Here are the curves. Left is the exploration policy, and right is the evaluation policy.

fig-Ant-v2

fig-HalfCheetah-v2

fig-Hopper-v2

fig-InvertedPendulum-v2

fig-Reacher-v2

fig-Walker2d-v2

The TL;DR is that TD3 wins on four of the environments, and DDPG wins on the other two. One of the ones TD3 doesn't win is InvertedPendulum but that should be easy to get to 1000 if the hyperparameters are tuned. Also to reiterate the code comments, I do not have standard deviation reported since that would make the plots quite hard to read.

I thought this might be useful, if you want to point people towards some baselines. (I didn't see any upon a quick glance, but maybe you have them somewhere else?) Anyway, I hope this is useful or at least remotely interesting!

DanielTakeshi commented 5 years ago

One more thing the examples script has code like this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L22-L24

and we are using Tanh policies:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L35-L39

Just wondering, is the NormalizedBoxEnv needed in this case? Perhaps it was just added to let us know what we could do with it later? By default it seems like we are not normalizing observations or returns. Thus, NormalizedBoxEnv would only serve to clip actions in [-1,1] for each component. But the tanh will naturally force it in that range anyway.

The only other possibility I can think of for the NormalizedBoxEnv is if the extra noise injected into the exploration policy causes the actions to exceed the [-1,1] range in some components. But after inserting some print and assertion checks in the normalized box env stepping method, and running python examples/ddpg.py, shows that no actions are outside the range so presumably the action+noise for exploration is clipped somewhere before that.

vitchyr commented 5 years ago

Thanks for this! I'll work on incorporating this into the documentation later.

The NormalizedBoxEnv is so that the env expects actions in [-1, 1]. I think this already happens by default for the gym envs, but if the action input range is actually [-2, 2], then this will rescale the actions accordingly. Like you said, another use case is also for clipping the noise. Frankly, I bet you could remove it without affect performance too much but I haven't tried.

On Mon, Jun 24, 2019, 10:17 AM Daniel Seita notifications@github.com wrote:

One more thing the examples script has code like this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L22-L24

and we are using Tanh policies:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L35-L39

Just wondering, is the NormalizedBoxEnv needed in this case? Perhaps it was just added to let us know what we could do with it later? By default it seems like we are not normalizing observations or returns. Thus, NormalizedBoxEnv would only serve to clip actions in [-1,1] for each component. But the tanh will naturally force it in that range anyway.

The only other possibility I can think of for the NormalizedBoxEnv is if the extra noise injected into the exploration policy causes the actions to exceed the [-1,1] range in some components. But after inserting some print and assertion checks in the normalized box env stepping method, and running python examples/ddpg.py, shows that no actions are outside the range so presumably the action+noise for exploration is clipped somewhere before that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vitchyr/rlkit/issues/63?email_source=notifications&email_token=AAJ4VZK5XR3QJJ3TQHKKYPDP4D6QXA5CNFSM4H2V57MKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYNTR7I#issuecomment-505100541, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ4VZKL7L6ACHV4XVQU573P4D6QXANCNFSM4H2V57MA .

ZhenhuiTang commented 1 year ago

Hi, I was wondering what is the difference between the exploration policy and the evaluation policy? Which one is common used in RL paper? I mean, is the training curve in the SAC paper is based on the exploration policy which corresponds to 'expl/Average Returns'? Why rewards from evaluation policy tends to better than that from the exploration policy?

I really look forward to your reply!

Hi @vitchyr

Thanks for the great code base. I was recently benchmarking some results here in search for some DDPG/TD3 implementations after my failure to get baselines working. I thought I'd share some results in case it would be useful to you or others.

For installation, I actually didn't entirely follow the installation instructions, but here's what I did:

  • I used a Python 3.6.7 pip virtualenv, and just manually installed the packages I saw in your installation yml file. I used torch 0.4.1 as recommended.
  • I actually used MuJoCo 2.0, so I was using the -v2 instances of the environments.
  • I used gym 0.12.5 and mujoco-py 2.0.2.2

I took the master branch from 5565dd5 and then adjusted the examples/td3.py and examples/ddpg.py so that they also imported other MuJoCo environments. In addition, _for TD3 only, I adjusted the hyperparameters in the "algorithmkwargs" so that they matched DDPG in the main method. To be clear, DDPG uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L71-L79

And TD3 uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/td3.py#L104-L111

I simply modified the td3.py script so that all hyperparameters above match DDPG, so in particular I changed: number of epochs to 1000, eval steps to 1000, min steps before training to 10k, and batch size to 128.

If I am not mistaken, this should mean that both the exploration and evaluation policies will experience 1 million total steps over the course of training. Though, because evaluation by default will discard incomplete trajectories, sometimes the actual number of steps reported by the debugger will be less than 1 million.

I ran DDPG and TD3 on six MuJoCo-v2 environments, for four random seeds each. I adjusted the code so my directory structure looks like this:

$ ls -lh data/
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Walker2d-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Walker2d-v2
$ ls -lh data/rlkit-ddpg-Ant-v2/
drwxrwxr-x 2 daniel daniel 4.0K Jun 20 20:49 rlkit-ddpg-Ant-v2_2019_06_20_20_49_44_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_20_53_49_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_44_22_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_49_37_0000--s-0
$ 

// other env results presented in a similar manner

For this I used the following plotting script where I just call it like python [script].py Ant-v2 and similarly for the other environments:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import pandas as pd
import os
import numpy as np
from os.path import join

# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
error_region_alpha = 0.25

def smoothed(x, w):
    """Smooth x by averaging over sliding windows of w, assuming sufficient length.
    """
    if len(x) <= w:
        return x
    smooth = []
    for i in range(1, w):
        smooth.append( np.mean(x[0:i]) )
    for i in range(w, len(x)+1):
        smooth.append( np.mean(x[i-w:i]) )
    assert len(x) == len(smooth), "lengths: {}, {}".format(len(x), len(smooth))
    return np.array(smooth)

def plot(args):
    """Load the progress csv file, and plot.

    Plot:
      'exploration/Returns Mean',
      'exploration/num steps total',
      'evaluation/Returns Mean',
      'evaluation/num steps total',
    """
    nrows, ncols = 1, 2
    fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
                           figsize=(11*ncols,6*nrows))

    algorithms = sorted([x for x in os.listdir('data/') if args.env in x])
    assert len(algorithms) == 2
    colors = ['blue', 'red']

    for idx,alg in enumerate(algorithms):
        print('Currently on algorithm: ', alg)
        alg_dir = join('data', alg)
        progfiles = sorted([
                join(alg_dir, x, 'progress.csv') for x in os.listdir(alg_dir)
        ])
        expl_returns = []
        eval_returns = []
        expl_steps = []
        eval_steps = []

        for prog in progfiles:
            df = pd.read_csv(prog, delimiter = ',')

            expl_ret = df['exploration/Returns Mean'].tolist()
            expl_returns.append(expl_ret)
            eval_ret = df['evaluation/Returns Mean'].tolist()
            eval_returns.append(eval_ret)

            expl_sp = df['exploration/num steps total'].tolist()
            expl_steps.append(expl_sp)
            eval_sp = df['evaluation/num steps total'].tolist()
            eval_steps.append(eval_sp)

        expl_returns = np.array(expl_returns)
        eval_returns = np.array(eval_returns)
        xs = expl_returns.shape[1]
        expl_ret_mean = np.mean(expl_returns, axis=0)
        eval_ret_mean = np.mean(eval_returns, axis=0)
        expl_ret_std = np.mean(expl_returns, axis=0)
        eval_ret_std = np.mean(eval_returns, axis=0)

        w = 10
        label0 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(expl_ret_mean[-w:]))
        label1 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(eval_ret_mean[-w:]))
        ax[0,0].plot(np.arange(xs), smoothed(expl_ret_mean, w=w),
                     color=colors[idx], label=label0)
        ax[0,1].plot(np.arange(xs), smoothed(eval_ret_mean, w=w),
                     color=colors[idx], label=label1)

        # This can be noisy.
        if False:
            ax[0,0].fill_between(np.arange(xs),
                                 expl_ret_mean-expl_ret_std,
                                 expl_ret_mean+expl_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])
            ax[0,1].fill_between(np.arange(xs),
                                 eval_ret_mean-eval_ret_std,
                                 eval_ret_mean+eval_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])

    for i in range(2):
        ax[0,i].tick_params(axis='x', labelsize=ticksize)
        ax[0,i].tick_params(axis='y', labelsize=ticksize)
        leg = ax[0,i].legend(loc="best", ncol=1, prop={'size':legendsize})
        for legobj in leg.legendHandles:
            legobj.set_linewidth(5.0)
    ax[0,0].set_title('{} (Exloration)'.format(args.env), fontsize=ysize)
    ax[0,1].set_title('{} (Evaluation)'.format(args.env), fontsize=ysize)

    plt.tight_layout()
    figname = 'fig-{}.png'.format(args.env)
    plt.savefig(figname)
    print("\nJust saved: {}".format(figname))

if __name__ == "__main__":
    pp = argparse.ArgumentParser()
    pp.add_argument('env', type=str)
    args = pp.parse_args()
    plot(args)

Here are the curves. Left is the exploration policy, and right is the evaluation policy.

fig-Ant-v2

fig-HalfCheetah-v2

fig-Hopper-v2

fig-InvertedPendulum-v2

fig-Reacher-v2

fig-Walker2d-v2

The TL;DR is that TD3 wins on four of the environments, and DDPG wins on the other two. One of the ones TD3 doesn't win is InvertedPendulum but that should be easy to get to 1000 if the hyperparameters are tuned. Also to reiterate the code comments, I do not have standard deviation reported since that would make the plots quite hard to read.

I thought this might be useful, if you want to point people towards some baselines. (I didn't see any upon a quick glance, but maybe you have them somewhere else?) Anyway, I hope this is useful or at least remotely interesting!