Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou

ChenDRAG commented 3 years ago

Purpose

The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing a benchmark(sac, td3, ddpg) on mujoco environments. Some features of tianshou platform will be enhanced along the way.

Introduction

By the time this issue is proposed, tianshou platform has attracted 2.4k star on github, and has become a very popular deep rl library based purely on Pytorch(in contrast with openai baseline, rl lab, etc), thanks to the contributions of @Trinkle23897 @duburcqa, @youkaichao, etc. However, with the users growing day by day, some problems start to spring up. One critical problem is that although Tianshou is a fast speed, structured, flexible library and supports many classic algorithms officially, it has done a relatively poor job on benchmarking the algorithm it supports. Examples and demonstrations are mostly tested on toy environments of gym, and we have not yet provided detailed comparison and analysis with classic papers on officially supported algorithms, which might make users worry about the correctness and efficiency of algorithms, make it a bit hard for researchers using Tianshou to reproduce results of classic papers because of the lack of trustworthy hyperparameters(baseline in other words).

Tianshou hopes to provide users with a lightweight and efficient drl platform and reduce the burden of rl researchers as much as possible. Even if users are only starters and might not be so familiar with drl algorithms or baselines, they can design their own algorithm with minimal lines of code by inheriting and using official data/algorithm structures, understand source code and compare their idea with standard algorithms easily. In order to achieve this, one thing we have to do is to provide a detailed benchmark for widely used algorithms and environments.

This is what I have been trying to do, and the first step has been taken. Using tianshou, I have managed to create a state-of-the-art benchmark on three algorithms on mujoco's mostly widely used 9/14 environments.

ddpg

Environment	Tianshou	spining up(Pytorch)	TD3 paper(ddpg)	TD3 paper(our ddpg)
Ant	990.4±4.3	~840	1005.3	888.8
HalfCheetah	11718.7±465.6	~11000	3305.6	8577.3
Hopper	2197.0±971.6	~1800	2020.5	1860.0
Walker2d	1400.6±905.0	~1950	1843.6	3098.1
Swimmer	144.1±6.5	~137	N	N
Humanoid	177.3±77.6	N	N	N
Reacher	-3.3±0.3	N	-6.51	-4.01
InvertedPendulum	1000.0±0.0	N	1000.0	1000.0
InvertedDoublePendulum	8364.3±2778.9	N	9355.5	8370.0

td3

Environment	Tianshou	spining up(Pytorch)	TD3 paper
Ant	5116.4±799.9	~3800	4372.4±1000.3
HalfCheetah	10201.2±772.8	~9750	9637.0±859.1
Hopper	3472.2±116.8	~2860	3564.1±114.7
Walker2d	3982.4±274.5	~4000	4682.8±539.6
Swimmer	104.2±34.2	~78	N
Humanoid	5189.5±178.5	N	N
Reacher	-2.7±0.2	N	-3.6±0.6
InvertedPendulum	1000.0±0.0	N	1000.0±0.0
InvertedDoublePendulum	9349.2±14.3	N	9337.5±15.0

sac

Environment	Tianshou	spining up(Pytorch)	SAC paper
Ant	5850.2±475.7	~3980	~3720
HalfCheetah	12138.8±1049.3	~11520	~10400
Hopper	3542.2±51.5	~3150	~3370
Walker2d	5007.0±251.5	~4250	~3740
Swimmer	44.4±0.5	~41.7	N
Humanoid	5488.5±81.2	N	~5200
Reacher	-2.6±0.2	N	N
InvertedPendulum	1000.0±0.0	N	N
InvertedDoublePendulum	9359.5±0.4	N	N

* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

**** We didn't compare to OPENAI baselines, because for now I think its benchmark is corrupted(?), and I haven't been able to find the information I need. But in spining up docs they stated that "Spinning Up implementations of DDPG, TD3, and SAC are roughly at-parity with the best-reported results for these algorithms", so I guess lack of comparisons with OPENAI baselines is okay.

I only show one figure here as an example, all other figures for tianshou mujoco benchmark can be found here.

To achieve the results is not easy, because it requires not only hyperparameter tuning, but some features of Tianshou platform have to be changed first, most of which are already mentioned in different issues by different users. For example:

140 #245 #255 metioned that collector collect whole episodes of data when setting n_step = 1.
249 mentioned that speed comparison might be unfair because originally Tianshou uses update step as absciss.
209 mentioned that original mujoco results can no longer be reproduced because we have changed the code a lot in the past few months.
discussion on #194 indicates that some policies officially supported by Tianshou can be refactored to be easier to use.
161 requires curve-drawing examples or tools, which is also urgently needed when creating a benchmark.

There are also other problems issues haven't mentioned or I haven't noticed. For instance:

In trainer, log_interval for update step and env step can only be the same, which will cause inconvenience. A flexible logger might help.
In net utils, Net function can only create MLP in which all hidden layer numbers is the same.
In policies, some policies will add explore noise when evaluating the algorithm.
Buffer and Collector now in Tianshou is a little bit too complex because they try to support all features in one single class, which will cause great inconvenience when trying to understand source code inheriting from those class to create customized data structure.

All the problems above will be taken care of to a certain extent when trying to release the benchmark. Scripts that achieve this benchmark is hosted on my fork of Tianshou, and can be found here. However, it cannot be directly merged, because it is only what we use to demonstrate our idea, so it's not well organized (Lack of consistency, docs, comments, tests, etc.). Another reason is that this will be a big merge on Tianshou and we want to try hard to enhance Tianshou without causing too much interference for our users. As a result, I make a plan and hope to merge all the codes in 6 commits in total in the next few weeks. All of these commits are targeted to releasing the benchmark above eventually.

Plans

Here I briefly introduce what these 6 commits try to do.

In net utils, enhance Net function, and make it support any type of MLP.

This is the most urgent commit because Net function will be needed in another pr.

Minor fix of batch and adding a new ReplayBuffer class called CachedReplayBuffer.

CachedReplayBuffer is used to replace _cached_buf of Collector in the next commit, which is critical to solving the n_step problem mentioned in #245.
change the definition of ReplayBuffer to certain management of Batch, because chronologically organized ReplayBuffer might not be suitable for all scenarios.
give all buffer types inheriting from ReplayBuffer the same API (indexing method for instance), let the developers worry about the underlying implementation of different types of ReplayBuffer, not the users.
[Probably] Separation of stack option and other abilities of ReplayBuffer, to make the source code easier to understand or rewrite. Gain efficiency at the same time.
docs, tests, etc.

Refactor of collector to support both ReplayBuffer and CachedReplayBuffer.

fix #245 by supporting CachedReplayBuffer and not allowing ReplayBuffer to work when n_env > 1.
removed those not widely used return info, make code more lightweight.
change BasePolicy to be prepared for the incoming change of indexing method of CachedReplayBuffer.
fix a bug in BasePolicy: when ignoring done and setting n_step > 1 in offpolicy algorithms, a small amount of targer q will have calculation error.
change the behavior of action noise, expl noise will all be added in collector from now on, making it easier to redefine, less possible to cause bug when added in forward function. Partly sovle #194.
little change in trainer, to coordinate collector's change.
docs, tests, etc.

Refactor of trainer to add a self-defined logger in trainer.

add a logger in trainer which can be self-defined and will be used in benchmarking.
remove original log_interval, save_fn, writer, etc. (All logging function).
add a default logger which basically do all jobs of original logging function. Partly solve #161.
docs, tests, etc.

Some small fixes in tianshou/policy to make policies easier to use and add some standard tricks to it.

take consideration of gym's 'TimeLimit.truncated' flag, to make the policy more efficient.

Releasing mujoco benchmark(source code, data, graphs, detailed comparison, analysis of hyperparameters, etc) on 3 algorithms.

Future work

Remove warnings and implementations for originally supported but now unsupported methods.
Adding support(benchmark in the same way) for other algorithms(VPG, PPO, TRPO, etc).
Speed analysis, and provide a set of hyperparameters that can be trained in Parallel using tianshou to speed up training.
Consider discrete-action environments like Atari (Maybe support rainbow on Tianshou).
A tutorial on how to tune hyperparameters of certain rl problem.
......

Trinkle23897 commented 3 years ago

In short, the major thing is to move cache_buffer (currently handled in Collector) into the buffer level, to support exactly n_step collect and make the collector cleaner.

I really love the method proposed by @ChenDRAG. He organizes the CachedReplayBuffer as:

| main_buffer | cache_buffer_1 | ... | cache_buffer_n |
|                 a whole batch                       |

where n == number of envs. All of these data are stored in a single (and large) batch. Therefore, we can greatly simplify the original collector's code.

Also, we plan to separate the async collect method to AsyncCollector (inherit from the simplified base collector). Most of the time the user uses sync method for experiments, but the current async code in collector has a lot of overhead. This split of functions will make things cleaner and easier for users to handle.

Trinkle23897 commented 3 years ago

TODO list after #280:

[x] split buffer and collector into several files
[ ] optimization for batch
[x] optimization for atari training -- it is almost half of the speed comparing to 0.3.2
[x] docs of tianshou.policy needs to add TOC

ChenDRAG commented 3 years ago

FIrst 5 of 6 commits disscused above is finsihed, I have reproduce mujoco benchmark of some algorithm in some environments. Some results are better, some are worse. Based on results i observe, we can still use the benchmark graph provided above. Perhaps dev branch is ready to be merged into master?

ChenDRAG commented 3 years ago

example results:

Trinkle23897 commented 3 years ago

Could you please provide the new numerical result here? (based on what you have experimented)

thu-ml / tianshou