Closed huxiao09 closed 2 years ago
Thanks for raising this issue. I think you are referring to https://github.com/sfujim/TD3/blob/385b33ac7de4767bab17eb02ade4a268d3e4e24f/TD3.py#L28? Yeah this is a bit of a problem. That said, it's fine in MuJoCo because max_action
is just 1.
Yep. I still highly suggest you adopt Fujimoto's implementations, because in some MuJoCo environments (like Humanoid-v2) and some other environments the max_action may not be 1.
And I also highly suggest you add the implementation of pytorch network model saving to each of your single file, and then write a separate single file that loads the network model and tests it in the corresponding environment.
Thanks for your reply and your nice cleanRL code : )
Thanks for the suggestion! Fixing the max_action
issue with the actor makes sense to me.
Model saving on the other hand is a bit more nuanced. For example, ppo_continuous_action.py technically also needs to save the running mean and standard deviations stored in the environment wrappers. Because of these complications, we have decided not doing model saving to keep the core implementation as simple as possible. Additionally, most people save the models only to load them later to see what the agent is actually doing. For this use case, we already included videos of the agents playing the game in the docs (e.g. link, further eliminating the use case of saving models.
@dosssman This is indeed a much larger problem - I just tested it out and there is a number of environments that do not have 1 as the default max_action
.
Testing environment: Ant-v2 Observation space: Box(-inf, inf, (111,), float64) Action space: Box(-1.0, 1.0, (8,), float32)
Testing environment: HalfCheetah-v2 Observation space: Box(-inf, inf, (17,), float64) Action space: Box(-1.0, 1.0, (6,), float32)
Testing environment: Hopper-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (3,), float32)
Testing environment: Humanoid-v2 Observation space: Box(-inf, inf, (376,), float64) Action space: Box(-0.4, 0.4, (17,), float32)
Testing environment: InvertedDoublePendulum-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (1,), float32)
Testing environment: InvertedPendulum-v2 Observation space: Box(-inf, inf, (4,), float64) Action space: Box(-3.0, 3.0, (1,), float32)
Testing environment: Pusher-v2 Observation space: Box(-inf, inf, (23,), float64) Action space: Box(-2.0, 2.0, (7,), float32)
Testing environment: Reacher-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (2,), float32)
Testing environment: Swimmer-v2 Observation space: Box(-inf, inf, (8,), float64) Action space: Box(-1.0, 1.0, (2,), float32)
Testing environment: Walker2d-v2 Observation space: Box(-inf, inf, (17,), float64) Action space: Box(-1.0, 1.0, (6,), float32)
@huxiao09 would you be interested in submitting a fix for this and going through the process of re-running benchmarks? Otherwise, no worries - we will take care of this.
Sure, it's my pleasure. But I am busy this week. I'm afraid I can only try my best to finish it in two weeks. Is that too late?
Thanks for the heads up. Will look into this.
Two weeks is actually a great timeline. Thank you @huxiao09 for your interest in working on this!
The process would go like this
max_action
openrlbenchmark
that tracks all of our benchmark experimentsBtw would this impact sac_continouous_action.py
@dosssman? I see in the SAC file there is no clip action like done in PPO:
SAC should be unaffected as the actions are scale to the min, max actions before being returned to the algorithm.
ok, get it :)
Thank you so much. Would you mind registering a wandb account and giving me your username? Would you also mind adding me on discord ( Costa#2021
)?
Thanks
Yes, I have already had an account: https://wandb.ai/huxiao . But it seems that discord can't be open in my region. I'll find how to solve this problem tomorrow.
Preliminary fix tracked here for TD3 and here for DDPG.
The corresponding PR is #211
Overall, it does not seem to make that big of a difference compared to the previous version, but this version would be theoretically correct indeed.
I changed the run set name to as follows (customize the advanced legend to be ${runsetName}
.
Could you do the following
td3_continuous_action_bound_fix.py
experiments with HalfCheetah-v2
(since its action space's bound is -1, +1)td3_continuous_action.py
experiments with Humanoid-v2
, Pusher-v2
, and InvertedPendulum-v2
to the openrlbenchmark/cleanrl-cache
project.td3_continuous_action_bound_fix.py
experiments with Humanoid-v2
, Pusher-v2
experiments to https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3--VmlldzoxNjk4Mzk5 (in particular customize the runset name to be CleanRL's td3_continuous_action.py
and set the advanced legend to be ${runsetName}
)Humanoid-v2
, Pusher-v2
, and InvertedPendulum-v2
. Humanoid-v2
, Pusher-v2
, and InvertedPendulum-v2
and explaining the use of action scale and bias.Thank you so much!
ok. @dosssman actually has done exactly what I am going to do. But I'm sorry I may not get you to see my reply in time.
Thank you very much for the feedback. Those two lines are quite a good catch. Will fix and add re-runs just to be sure.
Thank you Rousslan. Please remove the experiments or move them to cleanrl-cache project - the stale experiments should not remain in the openrlbenchmark/cleanrl.
From: Rousslan F.J. Dossa @.> Sent: Saturday, June 25, 2022 11:00:27 AM To: vwxyzjn/cleanrl @.> Cc: Costa Huang @.>; Comment @.> Subject: Re: [vwxyzjn/cleanrl] DDPG/TD3 target_actor output clip (Issue #196)
Thank you very much for the feedback. Will fix and add re-runs just to be sure.
— Reply to this email directly, view it on GitHubhttps://github.com/vwxyzjn/cleanrl/issues/196#issuecomment-1166305702, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKMJE47Z6HOASSAB26FDY3VQ4NIXANCNFSM5YMZVEJA. You are receiving this because you commented.Message ID: @.***>
I changed the run set name to as follows (customize the advanced legend to be
${runsetName}
.Could you do the following
- [ ] remove the new
td3_continuous_action_bound_fix.py
experiments withHalfCheetah-v2
(since its action space's bound is -1, +1)- [ ] move the old
td3_continuous_action.py
experiments withHumanoid-v2
,Pusher-v2
, andInvertedPendulum-v2
to theopenrlbenchmark/cleanrl-cache
project.- [ ] Add the new
td3_continuous_action_bound_fix.py
experiments withHumanoid-v2
,Pusher-v2
experiments to https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3--VmlldzoxNjk4Mzk5 (in particular customize the runset name to beCleanRL's td3_continuous_action.py
and set the advanced legend to be${runsetName}
)- [ ] update the benchmark file https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/td3.sh to include
Humanoid-v2
,Pusher-v2
, andInvertedPendulum-v2
.- [ ] Add a section in the documentation updating charts for
Humanoid-v2
,Pusher-v2
, andInvertedPendulum-v2
and explaining the use of action scale and bias.Thank you so much!
No worries. Will do. I tweak the last two lines @huxiao09 mentioned and re-runs the experiment to be sure there is no side-effect, then clean up the benchmark project structure and update the docs.
Should get it done by the next Tuesday or Wednesday. Thanks.
Let's just keep those old experiments to compare with the last variant until then. Once the latest one is validated, I can clean up all the previous variant and re-run "td3_continuous_action.py" to prevent future confusions.
re-runs are looking good so far, not much difference with the previous iterations. ~It seems I will have to trouble you regarding the archival / removal of the olds td3 and ddpg runs. Wandb mentions missing permissions when I tried to either move the runs to the cleanrl-cache project or just delete them, before queuing up clean "td3_continuous_actions.py" and "ddpg_continuous_actions.py" runs.~
@vwxyzjn Do need some help to remove this runs, as I am not their author.
Thanks.
Problem Description
Hi! It seems that the output of target_actor in DDPG/TD3 has been directly clipped to fit the action range boundaries, without multiplying by max_action. But in Fujimoto's DDPG/TD3 code[1] and some other implementations, the max_action has been add in the last tanh layer of the actor network, so they don't use clip. Have u ever tried the second implementation?
[1] https://github.com/sfujim/TD3