Closed jeppelangaa closed 4 years ago
Hello,
You should be using the hyperparameters from the rl zoo (that matches the original paper) as mentioned in the documentation. There are several differences with your current code:
PegInHole-v0
has a timelimit then you should add a time feature or remove the done
signal that are not real done (see https://github.com/hill-a/stable-baselines/pull/120#issuecomment-550966845)policy: 'MlpPolicy'
gamma: 0.99
buffer_size: 1000000
noise_type: 'normal'
noise_std: 0.1
learning_starts: 10000
batch_size: 100
learning_rate: !!float 1e-3
train_freq: 1000
gradient_steps: 1000
policy_kwargs: "dict(layers=[400, 300])"
Mentioned it before in some issue here & and seen others also do. I believe something is broken in TD3 in recent releases of stable-baselines >2.8.0
.
In fact the same HP setup that I previous used to take 3 or 4 place on OpenAI leaderboard would with 2.9.0
& 2.10.0
render totally useless results.
Have not had time to investigate it more myself. Will try to look in to it when i have time, but i encourage the stable-baselines authors to look in to it asap as TD3 is one of the more potential algos included.
@Mindgames
If you could provide an example code and versions which provide the different results, this would be a good start for the debugging. Indeed if there has been this bad regression (or even a change of default behaviour), it should be fixed asap.
I just ran a quick sanity check on HalfCheetahBulletEnv-v0
using the rl-zoo on Google colab:
python train.py --algo td3 --env HalfCheetahBulletEnv-v0 -params gamma:0.98 buffer_size:300000
it reaches a mean reward > 2000 in ~5e5 steps which is the expected behavior (I've got similar results using the PyTorch version and SAC both in SB and SB3).
Some checkpoints:
Eval num_timesteps=100000, episode_reward=662.24 +/- 35.97
Eval num_timesteps=200000, episode_reward=1257.75 +/- 20.91
Eval num_timesteps=300000, episode_reward=1575.77 +/- 6.34
Eval num_timesteps=400000, episode_reward=1974.59 +/- 14.39
Eval num_timesteps=500000, episode_reward=2314.57 +/- 26.04
Eval num_timesteps=600000, episode_reward=2604.17 +/- 26.65
Eval num_timesteps=700000, episode_reward=2613.97 +/- 31.47
Eval num_timesteps=800000, episode_reward=2746.85 +/- 16.53
Eval num_timesteps=900000, episode_reward=2681.37 +/- 10.16
Eval num_timesteps=1000000, episode_reward=2788.40 +/- 27.27
Hyperparameters:
OrderedDict([('batch_size', 100),
('buffer_size', 300000),
('env_wrapper', 'utils.wrappers.TimeFeatureWrapper'),
('gamma', 0.98),
('gradient_steps', 1000),
('learning_rate', 0.001),
('learning_starts', 10000),
('n_timesteps', 2000000.0),
('noise_std', 0.1),
('noise_type', 'normal'),
('policy', 'MlpPolicy'),
('policy_kwargs', 'dict(layers=[400, 300])'),
('train_freq', 1000)])
@araffin The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it.
I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet.
I have been using the hyperparameters that you stated, and it seems, even though that I'm not done testing yet, that the stable baselines implementation can reach the same level as the original implementation. Thank you very much.
The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it.
I think this is related to https://github.com/araffin/rl-baselines-zoo/issues/79, I recommend you to read the associated paper. You need to modify the algorithm if you remove the done signal (as it is done in the original implementation). Otherwise, you should use the TimeFeatureWrapper (https://github.com/araffin/rl-baselines-zoo/issues/79).
can reach the same level as the original implementation.
good to hear, please read the rl tips and tricks carefully next time ;)
I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet.
It is mentioned in the README, in the first page of the documentation, in the rl tips and has its own section in the documentation... Anyway, the repo is here: https://github.com/araffin/rl-baselines-zoo
the stable baselines implementation can reach the same level as the original implementation
closing this issue then.
I wanted to use the stable baselines implementation of TD3 in order to be able to compare the algorithm to other reinforcement learning algorithms more easily.
I have compared the original implementation of TD3 from sfujim to the one from stable baselines. The one from stable baselines is faster in terms of computation time, but it cannot achieve as good results as the original implementation. I have created a custom policy with two hidden layers of 256 nodes and set the hyperparameters according to the original implementation as seen below:
Using this code, the original implementation consistently achieves a performance of approximately 150% of the stable baselines implementation. Does anyone have any clues about some things that I am missing? Or maybe some ideas on why the results differ?
System Info Describe the characteristic of your environment: