vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.54k stars 631 forks source link

Td3 ddpg action bound fix #211

Closed dosssman closed 2 years ago

dosssman commented 2 years ago

Description

Adds action scaling into the Actor components of DDPG and TD3 to make sure the action boundaries of the environment are respected when sampling action to compute the target for Q learning update. Closes #196 .

Preliminary experiments tracked here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3-Action-Bound-Fix-Check--VmlldzoyMjAwMjM5

EDIT 1: Updated SAC continuous to use self.register_buffer for action_scale and action_bias.

Types of changes

Checklist:

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
cleanrl ✅ Ready (Inspect) Visit Preview Jun 30, 2022 at 1:46AM (UTC)
vwxyzjn commented 2 years ago

For some reason, the new ddpg runs don't utilize the GPU...?

image
vwxyzjn commented 2 years ago

Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl since they should not be affected by this PR (as they have the symmetrical [-1, 1] action space)

dosssman commented 2 years ago

Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl since they should not be affected by this PR (as they have the symmetrical [-1, 1] action space)

Pretty sure they do. Here is GPU usage from the latest DDPG continous runs: https://wandb.ai/openrlbenchmark/cleanrl/runs/38a4fiaq/system?workspace=user-dosssman image

dosssman commented 2 years ago

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.

actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)

Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.

Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in: https://github.com/vwxyzjn/cleanrl/blob/db6ff294e83f1e079f1fb1aa18bd71171b0dfab9/cleanrl/td3_continuous_action.py#L183-L190

EDIT: My bad, it was not done in DDPG yet.

vwxyzjn commented 2 years ago

Btw do we need to do anything about this? The max_action is still only working in a symmetric case.

actions = np.array(
    [
        (
            actions.tolist()[0]
            + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0])
        ).clip(envs.single_action_space.low, envs.single_action_space.high)
    ]
)

Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.

Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in:

https://github.com/vwxyzjn/cleanrl/blob/db6ff294e83f1e079f1fb1aa18bd71171b0dfab9/cleanrl/td3_continuous_action.py#L183-L190

EDIT: My bad, it was not done in DDPG yet.

Got it. Thank you! Could you update it in DDPG and update the docs? Then everything looks good to me to be merged :)

dosssman commented 2 years ago

On it.

dosssman commented 2 years ago

Docs updated.

dosssman commented 2 years ago

Queued up some runs with the latests changes as a quick check.

vwxyzjn commented 2 years ago

Cool thank you! If there is no regression in performance. Let's merge the PR.

dosssman commented 2 years ago

Results experiment do not differ much from the previous version. This PR should be ready for merge. Thanks in advance.