vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.54k stars 631 forks source link

RLops Guide #296

Closed vwxyzjn closed 1 year ago

vwxyzjn commented 2 years ago

Our current contribution guide mainly covers the process of contributing new algorithms. However, it is unclear what the process looks like for contributing to existing algorithms, which require a different set of procedures.

Problem

DRL is brittle and has a series of reproducibility issues — even bug fixes sometimes could introduce performance regression (e.g., see how a bug fix of contact force in MuJoCo results in worse performance for PPO). Therefore, it is essential to understand how the proposed changes impact the performance of the algorithms. At large, we wish to distinguish two types of contributions: 1) non-performance-impacting changes and 2) performance-impacting changes.

Importantly, regardless of the slight difference in performance-impacting changes, we need to re-run the benchmark to ensure there is no regression. This post proposes a way for us to re-run the model and check regression seamlessly.

Proposal

We should add a tag for every benchmark run to distinguish the version of CleanRL used to run the experiments. This can be done by

WANDB_TAGS=$(git describe --tags) OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \
    --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 InvertedPendulum-v2 Humanoid-v2 Pusher-v2 \
    --command "poetry run python cleanrl/td3_continuous_action.py --track --capture-video" \
    --num-seeds 3 \
    --workers 1

This gives us a tag in the tracked experiments, as shown below:

Screen Shot 2022-10-19 at 11 28 46 AM

Then we can design APIs to compare results from different tags / versions of the algorithm. Something like

import cleanrl_utils.compare
compare(
    ["HalfCheetah-v2", ],
    filters1={"exp_name": "td3_continuous_action", "tag": "v1.0.0b2-7-g4bb6766"},
    filters1={"exp_name": "td3_continuous_action", "tag": "v1.0.0b2-7-gxfd3d3"},
)

which could generate wandb reports with the following figure and corresponding tables.

image

If the newer tag version v1.0.0b2-7-g4bb6766 works without causing major regression, we can then label it as latest (and remove the tag latest for v1.0.0b2-7-gxfd3d3 correspondingly.

In the future, this will allow us to compare two completely different versions, too, like v1.0.0b2-7-g4bb6766 vs v1.5.0

CC @dosssman @yooceii @dipamc @kinalmehta @joaogui1 @araffin @bragajj @cool-RR @jkterry1 for thoughts

vwxyzjn commented 1 year ago

Closed by #368