thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.89k stars 1.12k forks source link

Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

Closed ChenDRAG closed 3 years ago

ChenDRAG commented 3 years ago

Purpose

The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing a benchmark(sac, td3, ddpg) on mujoco environments. Some features of tianshou platform will be enhanced along the way.

Introduction

By the time this issue is proposed, tianshou platform has attracted 2.4k star on github, and has become a very popular deep rl library based purely on Pytorch(in contrast with openai baseline, rl lab, etc), thanks to the contributions of @Trinkle23897 @duburcqa, @youkaichao, etc. However, with the users growing day by day, some problems start to spring up. One critical problem is that although Tianshou is a fast speed, structured, flexible library and supports many classic algorithms officially, it has done a relatively poor job on benchmarking the algorithm it supports. Examples and demonstrations are mostly tested on toy environments of gym, and we have not yet provided detailed comparison and analysis with classic papers on officially supported algorithms, which might make users worry about the correctness and efficiency of algorithms, make it a bit hard for researchers using Tianshou to reproduce results of classic papers because of the lack of trustworthy hyperparameters(baseline in other words).

Tianshou hopes to provide users with a lightweight and efficient drl platform and reduce the burden of rl researchers as much as possible. Even if users are only starters and might not be so familiar with drl algorithms or baselines, they can design their own algorithm with minimal lines of code by inheriting and using official data/algorithm structures, understand source code and compare their idea with standard algorithms easily. In order to achieve this, one thing we have to do is to provide a detailed benchmark for widely used algorithms and environments.

This is what I have been trying to do, and the first step has been taken. Using tianshou, I have managed to create a state-of-the-art benchmark on three algorithms on mujoco's mostly widely used 9/14 environments.

ddpg

Environment Tianshou spining up(Pytorch) TD3 paper(ddpg) TD3 paper(our ddpg)
Ant 990.4±4.3 ~840 1005.3 888.8
HalfCheetah 11718.7±465.6 ~11000 3305.6 8577.3
Hopper 2197.0±971.6 ~1800 2020.5 1860.0
Walker2d 1400.6±905.0 ~1950 1843.6 3098.1
Swimmer 144.1±6.5 ~137 N N
Humanoid 177.3±77.6 N N N
Reacher -3.3±0.3 N -6.51 -4.01
InvertedPendulum 1000.0±0.0 N 1000.0 1000.0
InvertedDoublePendulum 8364.3±2778.9 N 9355.5 8370.0

td3

Environment Tianshou spining up(Pytorch) TD3 paper
Ant 5116.4±799.9 ~3800 4372.4±1000.3
HalfCheetah 10201.2±772.8 ~9750 9637.0±859.1
Hopper 3472.2±116.8 ~2860 3564.1±114.7
Walker2d 3982.4±274.5 ~4000 4682.8±539.6
Swimmer 104.2±34.2 ~78 N
Humanoid 5189.5±178.5 N N
Reacher -2.7±0.2 N -3.6±0.6
InvertedPendulum 1000.0±0.0 N 1000.0±0.0
InvertedDoublePendulum 9349.2±14.3 N 9337.5±15.0

sac

Environment Tianshou spining up(Pytorch) SAC paper
Ant 5850.2±475.7 ~3980 ~3720
HalfCheetah 12138.8±1049.3 ~11520 ~10400
Hopper 3542.2±51.5 ~3150 ~3370
Walker2d 5007.0±251.5 ~4250 ~3740
Swimmer 44.4±0.5 ~41.7 N
Humanoid 5488.5±81.2 N ~5200
Reacher -2.6±0.2 N N
InvertedPendulum 1000.0±0.0 N N
InvertedDoublePendulum 9359.5±0.4 N N

* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

**** We didn't compare to OPENAI baselines, because for now I think its benchmark is corrupted(?), and I haven't been able to find the information I need. But in spining up docs they stated that "Spinning Up implementations of DDPG, TD3, and SAC are roughly at-parity with the best-reported results for these algorithms", so I guess lack of comparisons with OPENAI baselines is okay.

figure

I only show one figure here as an example, all other figures for tianshou mujoco benchmark can be found here.

To achieve the results is not easy, because it requires not only hyperparameter tuning, but some features of Tianshou platform have to be changed first, most of which are already mentioned in different issues by different users. For example:

There are also other problems issues haven't mentioned or I haven't noticed. For instance:

All the problems above will be taken care of to a certain extent when trying to release the benchmark. Scripts that achieve this benchmark is hosted on my fork of Tianshou, and can be found here. However, it cannot be directly merged, because it is only what we use to demonstrate our idea, so it's not well organized (Lack of consistency, docs, comments, tests, etc.). Another reason is that this will be a big merge on Tianshou and we want to try hard to enhance Tianshou without causing too much interference for our users. As a result, I make a plan and hope to merge all the codes in 6 commits in total in the next few weeks. All of these commits are targeted to releasing the benchmark above eventually.

Plans

Here I briefly introduce what these 6 commits try to do.

  1. In net utils, enhance Net function, and make it support any type of MLP.
  1. Minor fix of batch and adding a new ReplayBuffer class called CachedReplayBuffer.
  1. Refactor of collector to support both ReplayBuffer and CachedReplayBuffer.
  1. Refactor of trainer to add a self-defined logger in trainer.
  1. Some small fixes in tianshou/policy to make policies easier to use and add some standard tricks to it.
  1. Releasing mujoco benchmark(source code, data, graphs, detailed comparison, analysis of hyperparameters, etc) on 3 algorithms.

Future work

  1. Remove warnings and implementations for originally supported but now unsupported methods.
  2. Adding support(benchmark in the same way) for other algorithms(VPG, PPO, TRPO, etc).
  3. Speed analysis, and provide a set of hyperparameters that can be trained in Parallel using tianshou to speed up training.
  4. Consider discrete-action environments like Atari (Maybe support rainbow on Tianshou).
  5. A tutorial on how to tune hyperparameters of certain rl problem.
  6. ......
Trinkle23897 commented 3 years ago

In short, the major thing is to move cache_buffer (currently handled in Collector) into the buffer level, to support exactly n_step collect and make the collector cleaner.

I really love the method proposed by @ChenDRAG. He organizes the CachedReplayBuffer as:

| main_buffer | cache_buffer_1 | ... | cache_buffer_n |
|                 a whole batch                       |

where n == number of envs. All of these data are stored in a single (and large) batch. Therefore, we can greatly simplify the original collector's code.

Also, we plan to separate the async collect method to AsyncCollector (inherit from the simplified base collector). Most of the time the user uses sync method for experiments, but the current async code in collector has a lot of overhead. This split of functions will make things cleaner and easier for users to handle.

Trinkle23897 commented 3 years ago

TODO list after #280:

ChenDRAG commented 3 years ago

FIrst 5 of 6 commits disscused above is finsihed, I have reproduce mujoco benchmark of some algorithm in some environments. Some results are better, some are worse. Based on results i observe, we can still use the benchmark graph provided above. Perhaps dev branch is ready to be merged into master?

ChenDRAG commented 3 years ago

example results: image image image

Trinkle23897 commented 3 years ago

Could you please provide the new numerical result here? (based on what you have experimented)