Closed dosssman closed 2 years ago
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Updated |
---|---|---|---|
cleanrl | ✅ Ready (Inspect) | Visit Preview | Jun 30, 2022 at 1:46AM (UTC) |
For some reason, the new ddpg runs don't utilize the GPU...?
Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the openrlbenchmark/cleanrl
since they should not be affected by this PR (as they have the symmetrical [-1, 1]
action space)
Hey @dosssman I am going to put the old ddpg experiments (Hopper-v2, Walker-v2, HalfCheetah-v2) back to the
openrlbenchmark/cleanrl
since they should not be affected by this PR (as they have the symmetrical[-1, 1]
action space)
Pretty sure they do. Here is GPU usage from the latest DDPG continous runs: https://wandb.ai/openrlbenchmark/cleanrl/runs/38a4fiaq/system?workspace=user-dosssman
Btw do we need to do anything about this? The
max_action
is still only working in a symmetric case.actions = np.array( [ ( actions.tolist()[0] + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0]) ).clip(envs.single_action_space.low, envs.single_action_space.high) ] )
Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.
Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in: https://github.com/vwxyzjn/cleanrl/blob/db6ff294e83f1e079f1fb1aa18bd71171b0dfab9/cleanrl/td3_continuous_action.py#L183-L190
EDIT: My bad, it was not done in DDPG yet.
Btw do we need to do anything about this? The
max_action
is still only working in a symmetric case.actions = np.array( [ ( actions.tolist()[0] + np.random.normal(0, max_action * args.exploration_noise, size=envs.single_action_space.shape[0]) ).clip(envs.single_action_space.low, envs.single_action_space.high) ] )
Please don't worry about re-running the experiments, since all mujoco envs have symmetric low and high values for the action space, and we don't need to re-run the experiments as long as the changes don't impact the benchmark results in any way.
Yes, this was updated to center the distribution from which the exploration noise is sampled to the low-high range of the action space, as in:
EDIT: My bad, it was not done in DDPG yet.
Got it. Thank you! Could you update it in DDPG and update the docs? Then everything looks good to me to be merged :)
On it.
Docs updated.
Queued up some runs with the latests changes as a quick check.
Cool thank you! If there is no regression in performance. Let's merge the PR.
Results experiment do not differ much from the previous version. This PR should be ready for merge. Thanks in advance.
Description
Adds action scaling into the Actor components of DDPG and TD3 to make sure the action boundaries of the environment are respected when sampling action to compute the target for Q learning update. Closes #196 .
Preliminary experiments tracked here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3-Action-Bound-Fix-Check--VmlldzoyMjAwMjM5
EDIT 1: Updated SAC continuous to use
self.register_buffer
for action_scale and action_bias.Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.~If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.
--capture-video
flag toggled on (required).mkdocs serve
.~