hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.1k stars 728 forks source link

The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840

Closed jeppelangaa closed 4 years ago

jeppelangaa commented 4 years ago

I wanted to use the stable baselines implementation of TD3 in order to be able to compare the algorithm to other reinforcement learning algorithms more easily.

I have compared the original implementation of TD3 from sfujim to the one from stable baselines. The one from stable baselines is faster in terms of computation time, but it cannot achieve as good results as the original implementation. I have created a custom policy with two hidden layers of 256 nodes and set the hyperparameters according to the original implementation as seen below:

import peg_in_hole_env
import gym
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines import TD3
from stable_baselines.td3.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
from stable_baselines.ddpg.policies import LnMlpPolicy
from stable_baselines import results_plotter
from stable_baselines.bench import Monitor
from stable_baselines.results_plotter import load_results, ts2xy
from stable_baselines.common.noise import AdaptiveParamNoiseSpec
from stable_baselines.common.callbacks import BaseCallback
import os

# Custom MLP policy with two layers
class CustomTD3Policy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomTD3Policy, self).__init__(*args, **kwargs,
                                           layers=[256, 256],
                                           layer_norm=False,
                                           feature_extraction="mlp")

env = gym.make('PegInHole-v0')

n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = TD3(CustomTD3Policy, env, action_noise=action_noise, verbose=1, buffer_size=int(1e6), batch_size=256, learning_starts=int(25e3), train_freq=1, gradient_steps=1)
# Train the agent
model.learn(total_timesteps=1000000)

Using this code, the original implementation consistently achieves a performance of approximately 150% of the stable baselines implementation. Does anyone have any clues about some things that I am missing? Or maybe some ideas on why the results differ?

System Info Describe the characteristic of your environment:

araffin commented 4 years ago

Hello,

You should be using the hyperparameters from the rl zoo (that matches the original paper) as mentioned in the documentation. There are several differences with your current code:

Mindgames commented 4 years ago

Mentioned it before in some issue here & and seen others also do. I believe something is broken in TD3 in recent releases of stable-baselines >2.8.0 .

In fact the same HP setup that I previous used to take 3 or 4 place on OpenAI leaderboard would with 2.9.0 & 2.10.0 render totally useless results.

Have not had time to investigate it more myself. Will try to look in to it when i have time, but i encourage the stable-baselines authors to look in to it asap as TD3 is one of the more potential algos included.

Miffyli commented 4 years ago

@Mindgames

If you could provide an example code and versions which provide the different results, this would be a good start for the debugging. Indeed if there has been this bad regression (or even a change of default behaviour), it should be fixed asap.

araffin commented 4 years ago

I just ran a quick sanity check on HalfCheetahBulletEnv-v0 using the rl-zoo on Google colab:

python train.py --algo td3 --env HalfCheetahBulletEnv-v0 -params gamma:0.98 buffer_size:300000

it reaches a mean reward > 2000 in ~5e5 steps which is the expected behavior (I've got similar results using the PyTorch version and SAC both in SB and SB3).

Some checkpoints:

Eval num_timesteps=100000, episode_reward=662.24 +/- 35.97
Eval num_timesteps=200000, episode_reward=1257.75 +/- 20.91
Eval num_timesteps=300000, episode_reward=1575.77 +/- 6.34
Eval num_timesteps=400000, episode_reward=1974.59 +/- 14.39
Eval num_timesteps=500000, episode_reward=2314.57 +/- 26.04
Eval num_timesteps=600000, episode_reward=2604.17 +/- 26.65
Eval num_timesteps=700000, episode_reward=2613.97 +/- 31.47
Eval num_timesteps=800000, episode_reward=2746.85 +/- 16.53
Eval num_timesteps=900000, episode_reward=2681.37 +/- 10.16
Eval num_timesteps=1000000, episode_reward=2788.40 +/- 27.27

Hyperparameters:

OrderedDict([('batch_size', 100),
             ('buffer_size', 300000),
             ('env_wrapper', 'utils.wrappers.TimeFeatureWrapper'),
             ('gamma', 0.98),
             ('gradient_steps', 1000),
             ('learning_rate', 0.001),
             ('learning_starts', 10000),
             ('n_timesteps', 2000000.0),
             ('noise_std', 0.1),
             ('noise_type', 'normal'),
             ('policy', 'MlpPolicy'),
             ('policy_kwargs', 'dict(layers=[400, 300])'),
             ('train_freq', 1000)])
jeppelangaa commented 4 years ago

@araffin The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it.

I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet.

I have been using the hyperparameters that you stated, and it seems, even though that I'm not done testing yet, that the stable baselines implementation can reach the same level as the original implementation. Thank you very much.

araffin commented 4 years ago

The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it.

I think this is related to https://github.com/araffin/rl-baselines-zoo/issues/79, I recommend you to read the associated paper. You need to modify the algorithm if you remove the done signal (as it is done in the original implementation). Otherwise, you should use the TimeFeatureWrapper (https://github.com/araffin/rl-baselines-zoo/issues/79).

can reach the same level as the original implementation.

good to hear, please read the rl tips and tricks carefully next time ;)

I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet.

It is mentioned in the README, in the first page of the documentation, in the rl tips and has its own section in the documentation... Anyway, the repo is here: https://github.com/araffin/rl-baselines-zoo

araffin commented 4 years ago

the stable baselines implementation can reach the same level as the original implementation

closing this issue then.