added gamma to reward normalization wrappers

Howuhh commented 2 years ago

Description

Fixes incorrect gamma in reward normalization wrapper for non-default gamma's. See https://github.com/vwxyzjn/cleanrl/issues/203.

Types of changes

[x] Bug fix
[ ] New feature
[ ] New algorithm
[ ] Documentation

Checklist:

[x] I've read the CONTRIBUTION guide (required).
[x] I have ensured pre-commit run --all-files passes (required).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

[x] I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
[x] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
[x] I have added additional documentation and previewed the changes via mkdocs serve.
- [x] I have explained note-worthy implementation details.
- [x] I have added the learning curves (in PNG format with width=500 and height=300).
- [x] I have added links to the tracked experiments.

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jul 6, 2022 at 7:59PM (UTC)

vwxyzjn commented 2 years ago

So here is the tricky part - the original implementation actually uses 0.999 for gamma, but 0.99 for the normalization wrapper. See https://github.com/openai/train-procgen/blob/1a2ae2194a61f76a733a39339530401c024c3ad8/train_procgen/train.py#L43

This would cause a performance change unfortunately. There are two ways to go forward

re-run the procgen benchmark experiments with gym.wrappers.NormalizeReward(envs, gamma=args.gamma). https://github.com/vwxyzjn/cleanrl/blob/6387191dbee74927b2872b2eb1759c72361d806f/benchmark/ppo.sh#L39-L44 https://github.com/vwxyzjn/cleanrl/blob/6387191dbee74927b2872b2eb1759c72361d806f/benchmark/ppg.sh#L3-L8
keep the procgen scripts untouched.

@Howuhh what do you think we should do?

Howuhh commented 2 years ago

@vwxyzjn To be honest, I think this is a bug in original code, not a feature, so it will be more accurate to rerun for correct results. However, procgen is image based env and for now I don't have resources to train on images.

vwxyzjn commented 2 years ago

Ok, no worries. I will take care from here. @Dipamc77 I don't have the GPU memory to run the PPG experiments. Would you mind running them with this PR? I can take care of the ppo procgen experiments.

https://github.com/vwxyzjn/cleanrl/blob/6387191dbee74927b2872b2eb1759c72361d806f/benchmark/ppg.sh#L3-L8

vwxyzjn commented 2 years ago

Running the PPO experiments now. Also tried a fun thing by adding a wandb tag like

WANDB_TAGS=$(git describe --tags)  xvfb-run -a python -m cleanrl_utils.benchmark \
    --env-ids starpilot bossfight bigfish \
    --command "poetry run python cleanrl/ppo_procgen.py --track --capture-video" \
    --num-seeds 3 \
    --workers 1

which produces runs like

@dosssman I think this tagging system could somehow help us phase out past openrlbenchmark experiments without deleting them. I will have to think about the workflow a bit more.

vwxyzjn commented 2 years ago

Following up here