Td3 ddpg action bound fix

dosssman commented 2 years ago

Description

Adds action scaling into the Actor components of DDPG and TD3 to make sure the action boundaries of the environment are respected when sampling action to compute the target for Q learning update. Closes #196 .

Preliminary experiments tracked here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3-Action-Bound-Fix-Check--VmlldzoyMjAwMjM5

EDIT 1: Updated SAC continuous to use self.register_buffer for action_scale and action_bias.

Types of changes

[x] Bug fix
[ ] New feature
[ ] New algorithm
[ ] Documentation

Checklist:

[x] I've read the CONTRIBUTION guide (required).
[x] I have ensured pre-commit run --all-files passes (required).
[ ] ~I have updated the documentation and previewed the changes via mkdocs serve.~
[x] I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

[x] I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
[x] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
[x] ~I have added additional documentation and previewed the changes via mkdocs serve.~
- [x] I have explained note-worthy implementation details.

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jun 30, 2022 at 1:46AM (UTC)

vwxyzjn commented 2 years ago

For some reason, the new ddpg runs don't utilize the GPU...?

vwxyzjn commented 2 years ago

Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl since they should not be affected by this PR (as they have the symmetrical [-1, 1] action space)

dosssman commented 2 years ago

Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl since they should not be affected by this PR (as they have the symmetrical [-1, 1] action space)

Pretty sure they do. Here is GPU usage from the latest DDPG continous runs: https://wandb.ai/openrlbenchmark/cleanrl/runs/38a4fiaq/system?workspace=user-dosssman

dosssman commented 2 years ago

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.
actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)
Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.

Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in: https://github.com/vwxyzjn/cleanrl/blob/db6ff294e83f1e079f1fb1aa18bd71171b0dfab9/cleanrl/td3_continuous_action.py#L183-L190

EDIT: My bad, it was not done in DDPG yet.

vwxyzjn commented 2 years ago

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.
actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)
Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.
Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in:

https://github.com/vwxyzjn/cleanrl/blob/db6ff294e83f1e079f1fb1aa18bd71171b0dfab9/cleanrl/td3_continuous_action.py#L183-L190

EDIT: My bad, it was not done in DDPG yet.

Got it. Thank you! Could you update it in DDPG and update the docs? Then everything looks good to me to be merged :)

dosssman commented 2 years ago

On it.

dosssman commented 2 years ago

Docs updated.

dosssman commented 2 years ago

Queued up some runs with the latests changes as a quick check.

vwxyzjn commented 2 years ago

Cool thank you! If there is no regression in performance. Let's merge the PR.

dosssman commented 2 years ago

Results experiment do not differ much from the previous version. This PR should be ready for merge. Thanks in advance.

vwxyzjn / cleanrl