(Probably) Unfair Speed Comparison

araffin commented 3 years ago

Hello,

You advertise tianshou as being fast and provide in the readme a comparison table. However, no reference code is linked to reproduce the results.

So, I decided to create a colab notebook to have fair comparison (same hyperparameters) between tianshou, Stable-Baselines and Stable-Baselines3.

On the two environments tested (Pendulum and Cartpole), the results are quite far from those reported for Stable-Baselines... Even worse, when compared to SB2 on the Pendulum environment, Tianshou seems slower (and seems to give worse final performance).

Time to reach mean reward of -200 for TD3 on Pendulum-v0 (1 env): SB2: ~30s (vs 99s in the readme) SB3: ~70s Tianshou: >110s (vs 44s in the readme)

You can find the notebook here

(Please check tianshou code, I'm not 100% sure that I re-used the same hyperparameters)

Trinkle23897 commented 3 years ago

Hello Antonin,

Thank you for reporting this issue. I very appreciate your excellent work on SB and SB3. This result is last modified in May 2020 with Ray-RLlib v0.8.5, Baselines (commit ea25b9e), PyTorch-DRL (commit 49b5ec0), SB version 2.10.0, rlpyt (commit 668290d), tianshou (commit 57bca16). Here is the five-runs raw result:

Task: CartPole-v0
RLlib
% pg: 19.43, 17.62, 18.27, 17.38, 23.61
% dqn: 36.21, 27.79, 25.82, 30.42, 22.57
% a2c: 42.84, 55.18, 63.22, 55.45, 72.91
% ppo: 27.18, 29.08, 54.44, 39.65, 72.64
PyTorch-DRL
% dqn: 24.21, 53.96, 24.42, 28.17, 27.12
% ppo: 9.30, 21.11, 22.26, 30.91, 36.39
SB
% dqn: 45.84, 108.08, 51.31, 59.56, 202.58
% a2c: 81.00, 44.06, 56.70, 47.81, 58.23
% ppo: 20.64, 53.35, 21.50, 57.78, 20.67
tianshou
% pg: 1.65, 4.98, 14.79, 6.01, 3.03
% dqn: 5.14, 6.32, 7.62, 5.41, 5.97
% a2c: 9.54, 12.06, 8.17, 9.4, 13.8
% ppo: 30.12, 25.21, 43.53, 22.63, 37.59

Task: Pendulum-v0
RLlib
% ppo: 126.91, 105.82, 131.34, 195.46, 58.56
% ddpg: 312.93, 329.85, 307.26, 313.70, 309.75
% td3: 139.18, 158.29, 144.52, 158.24, 149.29
% sac: 102.93, 95.21, 89.95, 102.04, 96.98
Baselines
% ppo: 804.92, 832.88, 444.79, 733.01, 911.53
PyTorch-DRL
% ddpg: 42.50, 56.21, 69.02, 57.53, 69.99
% td3: 43.97, 46.44, 46.06, 91.04, 60.10
% sac: 113.88, 37.82, 40.08, 64.38, 62.84
SB
% ppo: 206.71, 284.84, 271.73, 271.81, 263.58
% ddpg: 206.58, 384.53, 135.68, 140.45, 270.36
% td3: 86.22, 142.88, 91.53, 88.77, 89.34
% sac: 251.22, 123.47, 165.39, 42.07, 42.10
rlpyt
% ddpg: 180.56, 130.14, 105.95, 106.69, 94.51
% td3: 106.37, 98.42, 136.02, 119.05, 105.12
% sac: 122.58, 169.20, 104.50, 141.96, 125.77
tianshou
% ppo: 17.64, 14.97, 20.29, 13.28, 14.70
% ddpg: 24.34, 51.15, 30.25, 36.46, 44.09
% td3: 38.22, 52.67, 42.15, 50.32, 36.85
% sac: 35.56, 35.08, 35.61, 36.83, 37.04

All of the above tests are run on single-thread, with i7-8750H + GTX1060 (driver version 440.64) + PyTorch 1.4.0 + TensorFlow 1.14.0 + CUDA 10.0 + UBUNTU 18.04 + Python 3.6.9

I did not align each of the repo to exactly the same set of hyperparameter, and there are some reasons:

At first, I indeed wanted to align the hyperparameter, but when I tried to apply my own hparam to other codebases, they all had a very bad performance (you can have a try at other repo instead of tianshou). The hyperparameter depends on each of the implementations, for example, MSELoss w/o *0.5. So I assume the provided examples from other codebases are their best results, and I use them to generate the above results.
One of the motivations I want to build up tianshou is that, we can accelerate the environment simulation by running multiple environments parallelly. This batch-style environment simulation with ALL RL algorithm isn't implemented by any of the existing codebases. Therefore, I wrote the code in order to support this feature naturally. In the test, I cannot re-implemented the parallel env simulation in each of the codebases (this is a huge work!) but I also want to demonstrate the effectiveness of this method. (I think it is more reasonable to add a column with only 1-env configuration in tianshou, and I guess it may slower than others, what do you think?)

However, as you can see, we do not update this result these months. Since there is no free lunch, supporting other user-friendly features such as async-vecenv, multi-level dict state would slower the existing functionality. I choose the latter but tried to find the balance between various feature requests, performance concerns, and user experience (such as #189).

Best, Jiayi

araffin commented 3 years ago

Thank you for reporting this issue. I very appreciate your excellent work on SB and SB3.

Thanks =)

Here is the five-runs raw result:

You are still not providing any code to reproduce those (a colab notebook would be best as I did in my first message ;)).

I did not align each of the repo to exactly the same set of hyperparameter, and there are some reasons:

well, you should compare apples to apples*, otherwise showing a comparison table is at least misguiding, at worst dishonest.

This is even more true as you advertising tianshou as "superfast".

*at least in term of number of envs used for training, which is in my opinion the main factor in training speed.

At first, I indeed wanted to align the hyperparameter, but when I tried to apply my own hparam to other codebases, they all had a very bad performance (you can have a try at other repo instead of tianshou).

I would agree with that for algorithms that require special care (A2C, PPO) but not for off-policy ones (SAC, TD3, DDPG).

One of the motivations I want to build up tianshou is that, we can accelerate the environment simulation by running multiple environments parallelly.

I don't dispute that feature. The fact that tianshou supports dict and multiprocessing for all algorithms is a cool thing* and must be highlighted!

But instead of showing a features table, you decided to show a speed comparison table which apparently does not compare the same things. So my main concern is the way things are presented.

*in fact, this is in our roadmap too, see issues https://github.com/DLR-RM/stable-baselines3/issues/179 and https://github.com/DLR-RM/stable-baselines3/issues/216

In the test, I cannot re-implemented the parallel env simulation in each of the codebases (this is a huge work!)

Well, for some codebases at least (baselines, stable-baselines), it is easy to implement.

And I know this is work (i'm currently checking SB3 performance vs reference implementations), but this is science, you must be fair if you compare things.

Also, if you provide code to reproduce results, other maintainers will come and help you to do that fair comparison ;).

I think it is more reasonable to add a column with only 1-env configuration in tianshou, and I guess it may slower than others, what do you think?

well if all other codebases were tested with one env, then you must for sure add a column with one env for tianshou.

Trinkle23897 commented 3 years ago

You are still not providing any code to reproduce those (a colab notebook would be best as I did in my first message ;)). Also, if you provide code to reproduce results, other maintainers will come and help you to do that fair comparison ;).

The code is exactly under the test/discrete/ and test/continuous/ dir and you can checkout that commit to run.

well, you should compare apples to apples, otherwise showing a comparison table is at least misguiding, at worst dishonest. This is even more true as you advertising tianshou as "superfast". at least in term of number of envs used for training, which is in my opinion the main factor in training speed. But instead of showing a features table, you decided to show a speed comparison table which apparently does not compare the same things. So my main concern is the way things are presented. And I know this is work (i'm currently checking SB3 performance vs reference implementations), but this is science, you must be fair if you compare things. well if all other codebases were tested with one env, then you must for sure add a column with one env for tianshou.

I agree with this point. But at that time I thought tianshou used more computation but within a shorter time, and others use less computation and a longer time, is this unfair to tianshou to some extend...? If we make sure all of these use exactly the same amount of computational resources, the result will be much worse.

Let's discuss the solution here:

remove the "superfast" in the description. As I mentioned before, the speed is not the first concern of this repo;
modify the table in readme, add a column by using 1-env config with tianshou to make it fair enough.

I'm very appreciate your insightful discussion!

araffin commented 3 years ago

The code is exactly under the test/discrete/ and test/continuous/ dir and you can checkout that commit to run.

I meant the code for benchmarking the other libraries.

But at that time I thought tianshou used more computation but within a shorter time, and others use less computation and a longer time, this is unfair to tianshou to some extend...?

I'm not sure to get your point. Could you elaborate?

Let's discuss the solution here:

sounds good ;)

and probably add a note explaining the 2nd column for tianshou (if you keep it) and specifying the number of parallel env used.

Trinkle23897 commented 3 years ago

I meant the code for benchmarking the other libraries.

For SB, I use your rl-baselines-zoo as recommended in the readme of stable-baselines, for example,

python3 train.py --algo ddpg --env Pendulum-v0 --seed 4 --eval-episodes 10

I add the timer in train.py. (p.s., it should be --eval-episodes 100 to make a fair comparison with tianshou but in order to make the comparison unfair enough to tianshou, I used --eval-episodes 10 at that time)

I'm not sure to get your point. Could you elaborate?

Effectiveness: longer time, lower computation < longer time, higher computation / shorter time, shorter computation < shorter time, higher computation. I directly compared the first item and the last item at that time.

araffin commented 3 years ago

For SB, I use your rl-baselines-zoo as recommended in the readme of stable-baselines, for example,

oh, I see... well this code does something completely different as what is mentioned in the readme table: here you train one algorithm for a fixed number of timesteps (this is what the zoo does, there is no early stopping by default, unless that's what you meant by "I add the timer"). And the fixed number of steps is usually a upper limit, the goal was to have working hyperparameters, not the most sample efficient / fastest ones.

And here, those are old defaults (2e5 timesteps, whereas you can solve this env with DDPG in 2e4 steps by setting the learning rate to 1e-3), that's why I was advocating to use the same hyperparams at least for off-policy algorithms.

I add the timer in train.py. (p.s., it should be --eval-episodes 100 to make a fair comparison with tianshou but in order to make the comparison unfair enough to tianshou, I used --eval-episodes 10 at that time)

yes and no, I would separate the evaluation time from the training time (or again, you need to use the same number of tests environments when comparing*, otherwise you compare different things). And it seems you also used the evaluation by default, so every 10k steps (could be changed too).

*parallel tests env is currently not implemented in SB3 though... but I can write a quick and dirty version which run at least n_eval_episodes

Effectiveness: longer time, lower computation < longer time, higher computation / shorter time, shorter computation < shorter time, higher computation. I directly compared the first item and the last item at that time.

I would agree if we were comparing different algorithms and different hyperparameters.

For instance, with off-policy algorithms, you can do more gradients steps than steps that was done in the environment, and here you have a tradeoff between computation and sample efficiency.

The same goes when using more than one environment to train (we have a notebook about that tradeoff): if you use more environments in parallel, your training time will be usually faster but at a cost of sample efficiency.

But if you run two implementations of the same algorithm with the same number of training environments, you should be able to have a fair comparison ;)

Trinkle23897 commented 3 years ago

this is what the zoo does, there is no early stopping by default, unless that's what you meant by "I add the timer"

I indeed added extra callback in train.py:

@@ -293,8 +310,10 @@ if __name__ == '__main__':
         if args.verbose > 0:
             print("Creating test environment")

-        save_vec_normalize = SaveVecNormalizeCallback(save_freq=1, save_path=params_path)
-        eval_callback = EvalCallback(create_env(1, eval_env=True), callback_on_new_best=save_vec_normalize,
+        # save_vec_normalize = SaveVecNormalizeCallback(save_freq=1, save_path=params_path)
+        callback_on_best = StopTrainingOnRewardThreshold(
+            reward_threshold=195, verbose=1)
+        eval_callback = EvalCallback(create_env(1, eval_env=True), callback_on_new_best=callback_on_best,
                                      best_model_save_path=save_path, n_eval_episodes=args.eval_episodes,
                                      log_path=save_path, eval_freq=args.eval_freq)
         callbacks.append(eval_callback)

and made the test.

remove the "superfast" in the description. As I mentioned before, the speed is not the first concern of this repo;

This is done.

For the rest, I'll do a comprehensive test in the winter break because I'm currently busy with my final exam and course projects :( I also come up with an idea that can support async-venv without any overhead, but this requires a large amount of code refactoring. I tend to implement it in this winter.

araffin commented 3 years ago

Linking issues related to that comparison:

parallel training for off-policy algos: https://github.com/DLR-RM/stable-baselines3/issues/179
using multiple environments for evaluation: https://github.com/DLR-RM/stable-baselines3/issues/402

Trinkle23897 commented 2 years ago

@araffin please have a look at #406, I've changed the table. Sorry for the delay.

araffin commented 2 years ago

@araffin please have a look at #406, I've changed the table. Sorry for the delay.

LGTM, thanks =)

Trinkle23897 commented 2 years ago

BTW, could you please fix sb2's ci pipeline to let the badge look nicer?

araffin commented 2 years ago

BTW, could you please fix sb2's ci pipeline to let the badge look nicer?

unfortunately, I am facing issues with Codacy... but as the code won't change, you can fix the code coverage to 86% (current coverage):

araffin commented 2 years ago

As a small remark, the true number of algorithms implemented is 9 for SB3 (you have to include TQC / QR-DQN from the contrib repo: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib)

Trinkle23897 commented 2 years ago

Hi @araffin, please have a look at #418. I can help change any description if you want.

araffin commented 2 years ago

Hi @araffin, please have a look at #418. I can help change any description if you want.

LGTM, thanks ;)

thu-ml / tianshou

(Probably) Unfair Speed Comparison #249