RLops Guide - Githubissues

Our current contribution guide mainly covers the process of contributing new algorithms. However, it is unclear what the process looks like for contributing to existing algorithms, which require a different set of procedures.

Problem

DRL is brittle and has a series of reproducibility issues — even bug fixes sometimes could introduce performance regression (e.g., see how a bug fix of contact force in MuJoCo results in worse performance for PPO). Therefore, it is essential to understand how the proposed changes impact the performance of the algorithms. At large, we wish to distinguish two types of contributions: 1) non-performance-impacting changes and 2) performance-impacting changes.

non-performance-impacting changes: this type of change does not impact the performance of the algorithm, such as documentation fixes (#282), renaming variables (#257), and removing unused code (#287). For this type of change, we can easily merge them without worrying too much about the consequences.
performance-impacting changes: this type of change impacts the algorithm's performance. Examples include making a slight modification to the gamma parameter in PPO (https://github.com/vwxyzjn/cleanrl/pull/209), properly handling action bounds in DDPG (https://github.com/vwxyzjn/cleanrl/pull/211), and fixing bugs (https://github.com/vwxyzjn/cleanrl/pull/281)

Importantly, regardless of the slight difference in performance-impacting changes, we need to re-run the benchmark to ensure there is no regression. This post proposes a way for us to re-run the model and check regression seamlessly.

Proposal

We should add a tag for every benchmark run to distinguish the version of CleanRL used to run the experiments. This can be done by

WANDB_TAGS=$(git describe --tags) OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \
    --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 InvertedPendulum-v2 Humanoid-v2 Pusher-v2 \
    --command "poetry run python cleanrl/td3_continuous_action.py --track --capture-video" \
    --num-seeds 3 \
    --workers 1

This gives us a tag in the tracked experiments, as shown below:

Then we can design APIs to compare results from different tags / versions of the algorithm. Something like

import cleanrl_utils.compare
compare(
    ["HalfCheetah-v2", ],
    filters1={"exp_name": "td3_continuous_action", "tag": "v1.0.0b2-7-g4bb6766"},
    filters1={"exp_name": "td3_continuous_action", "tag": "v1.0.0b2-7-gxfd3d3"},
)

which could generate wandb reports with the following figure and corresponding tables.

If the newer tag version v1.0.0b2-7-g4bb6766 works without causing major regression, we can then label it as latest (and remove the tag latest for v1.0.0b2-7-gxfd3d3 correspondingly.

In the future, this will allow us to compare two completely different versions, too, like v1.0.0b2-7-g4bb6766 vs v1.5.0

CC @dosssman @yooceii @dipamc @kinalmehta @joaogui1 @araffin @bragajj @cool-RR @jkterry1 for thoughts

vwxyzjn / cleanrl

RLops Guide #296

Problem

Proposal