vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.54k stars 631 forks source link

DDPG/TD3 target_actor output clip #196

Closed huxiao09 closed 2 years ago

huxiao09 commented 2 years ago

Problem Description

Hi! It seems that the output of target_actor in DDPG/TD3 has been directly clipped to fit the action range boundaries, without multiplying by max_action. But in Fujimoto's DDPG/TD3 code[1] and some other implementations, the max_action has been add in the last tanh layer of the actor network, so they don't use clip. Have u ever tried the second implementation?

if global_step > args.learning_starts:
                data = rb.sample(args.batch_size)
                with torch.no_grad():
                    next_state_actions = (target_actor(data.next_observations)).clamp(
                        envs.single_action_space.low[0], envs.single_action_space.high[0]
                    )
                    qf1_next_target = qf1_target(data.next_observations, next_state_actions)
                    next_q_value = data.rewards.flatten() + (1 - data.dones.flatten()) * args.gamma * (qf1_next_target).view(-1)

[1] https://github.com/sfujim/TD3

vwxyzjn commented 2 years ago

Thanks for raising this issue. I think you are referring to https://github.com/sfujim/TD3/blob/385b33ac7de4767bab17eb02ade4a268d3e4e24f/TD3.py#L28? Yeah this is a bit of a problem. That said, it's fine in MuJoCo because max_action is just 1.

huxiao09 commented 2 years ago

Yep. I still highly suggest you adopt Fujimoto's implementations, because in some MuJoCo environments (like Humanoid-v2) and some other environments the max_action may not be 1.

And I also highly suggest you add the implementation of pytorch network model saving to each of your single file, and then write a separate single file that loads the network model and tests it in the corresponding environment.

Thanks for your reply and your nice cleanRL code : )

vwxyzjn commented 2 years ago

Thanks for the suggestion! Fixing the max_action issue with the actor makes sense to me.

Model saving on the other hand is a bit more nuanced. For example, ppo_continuous_action.py technically also needs to save the running mean and standard deviations stored in the environment wrappers. Because of these complications, we have decided not doing model saving to keep the core implementation as simple as possible. Additionally, most people save the models only to load them later to see what the agent is actually doing. For this use case, we already included videos of the agents playing the game in the docs (e.g. link, further eliminating the use case of saving models.

vwxyzjn commented 2 years ago

@dosssman This is indeed a much larger problem - I just tested it out and there is a number of environments that do not have 1 as the default max_action.

Testing environment: Ant-v2 Observation space: Box(-inf, inf, (111,), float64) Action space: Box(-1.0, 1.0, (8,), float32)
Testing environment: HalfCheetah-v2 Observation space: Box(-inf, inf, (17,), float64) Action space: Box(-1.0, 1.0, (6,), float32)
Testing environment: Hopper-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (3,), float32)
Testing environment: Humanoid-v2 Observation space: Box(-inf, inf, (376,), float64) Action space: Box(-0.4, 0.4, (17,), float32)
Testing environment: InvertedDoublePendulum-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (1,), float32)
Testing environment: InvertedPendulum-v2 Observation space: Box(-inf, inf, (4,), float64) Action space: Box(-3.0, 3.0, (1,), float32)
Testing environment: Pusher-v2 Observation space: Box(-inf, inf, (23,), float64) Action space: Box(-2.0, 2.0, (7,), float32)
Testing environment: Reacher-v2 Observation space: Box(-inf, inf, (11,), float64) Action space: Box(-1.0, 1.0, (2,), float32)
Testing environment: Swimmer-v2 Observation space: Box(-inf, inf, (8,), float64) Action space: Box(-1.0, 1.0, (2,), float32)
Testing environment: Walker2d-v2 Observation space: Box(-inf, inf, (17,), float64) Action space: Box(-1.0, 1.0, (6,), float32)

@huxiao09 would you be interested in submitting a fix for this and going through the process of re-running benchmarks? Otherwise, no worries - we will take care of this.

huxiao09 commented 2 years ago

Sure, it's my pleasure. But I am busy this week. I'm afraid I can only try my best to finish it in two weeks. Is that too late?

dosssman commented 2 years ago

Thanks for the heads up. Will look into this.

vwxyzjn commented 2 years ago

Two weeks is actually a great timeline. Thank you @huxiao09 for your interest in working on this!

The process would go like this

  1. file a PR to make the changes with the max_action
  2. we agree on the changes
  3. have you join our wandb team openrlbenchmark that tracks all of our benchmark experiments
  4. run the following benchmark scripts https://github.com/vwxyzjn/cleanrl/blob/6387191dbee74927b2872b2eb1759c72361d806f/benchmark/ddpg.sh#L1-L7 https://github.com/vwxyzjn/cleanrl/blob/6387191dbee74927b2872b2eb1759c72361d806f/benchmark/td3.sh#L1-L7

Btw would this impact sac_continouous_action.py @dosssman? I see in the SAC file there is no clip action like done in PPO:

https://github.com/vwxyzjn/cleanrl/blob/6387191dbee74927b2872b2eb1759c72361d806f/cleanrl/ppo_continuous_action.py#L89

dosssman commented 2 years ago

SAC should be unaffected as the actions are scale to the min, max actions before being returned to the algorithm.

huxiao09 commented 2 years ago

ok, get it :)

vwxyzjn commented 2 years ago

Thank you so much. Would you mind registering a wandb account and giving me your username? Would you also mind adding me on discord ( Costa#2021 )?

Thanks

huxiao09 commented 2 years ago

Yes, I have already had an account: https://wandb.ai/huxiao . But it seems that discord can't be open in my region. I'll find how to solve this problem tomorrow.

dosssman commented 2 years ago

Preliminary fix tracked here for TD3 and here for DDPG.

The corresponding PR is #211

Overall, it does not seem to make that big of a difference compared to the previous version, but this version would be theoretically correct indeed.

vwxyzjn commented 2 years ago

I changed the run set name to as follows (customize the advanced legend to be ${runsetName}.

image

Could you do the following

Thank you so much!

huxiao09 commented 2 years ago

ok. @dosssman actually has done exactly what I am going to do. But I'm sorry I may not get you to see my reply in time. image

dosssman commented 2 years ago

Thank you very much for the feedback. Those two lines are quite a good catch. Will fix and add re-runs just to be sure.

vwxyzjn commented 2 years ago

Thank you Rousslan. Please remove the experiments or move them to cleanrl-cache project - the stale experiments should not remain in the openrlbenchmark/cleanrl​.


From: Rousslan F.J. Dossa @.> Sent: Saturday, June 25, 2022 11:00:27 AM To: vwxyzjn/cleanrl @.> Cc: Costa Huang @.>; Comment @.> Subject: Re: [vwxyzjn/cleanrl] DDPG/TD3 target_actor output clip (Issue #196)

Thank you very much for the feedback. Will fix and add re-runs just to be sure.

— Reply to this email directly, view it on GitHubhttps://github.com/vwxyzjn/cleanrl/issues/196#issuecomment-1166305702, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKMJE47Z6HOASSAB26FDY3VQ4NIXANCNFSM5YMZVEJA. You are receiving this because you commented.Message ID: @.***>

dosssman commented 2 years ago

I changed the run set name to as follows (customize the advanced legend to be ${runsetName}.

image

Could you do the following

  • [ ] remove the new td3_continuous_action_bound_fix.py experiments with HalfCheetah-v2 (since its action space's bound is -1, +1)
  • [ ] move the old td3_continuous_action.py experiments with Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 to the openrlbenchmark/cleanrl-cache project.
  • [ ] Add the new td3_continuous_action_bound_fix.py experiments with Humanoid-v2, Pusher-v2 experiments to https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/MuJoCo-CleanRL-s-TD3--VmlldzoxNjk4Mzk5 (in particular customize the runset name to be CleanRL's td3_continuous_action.py and set the advanced legend to be ${runsetName})
  • [ ] update the benchmark file https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/td3.sh to include Humanoid-v2, Pusher-v2, and InvertedPendulum-v2.
  • [ ] Add a section in the documentation updating charts for Humanoid-v2, Pusher-v2, and InvertedPendulum-v2 and explaining the use of action scale and bias.

Thank you so much!

No worries. Will do. I tweak the last two lines @huxiao09 mentioned and re-runs the experiment to be sure there is no side-effect, then clean up the benchmark project structure and update the docs.

Should get it done by the next Tuesday or Wednesday. Thanks.

dosssman commented 2 years ago

Let's just keep those old experiments to compare with the last variant until then. Once the latest one is validated, I can clean up all the previous variant and re-run "td3_continuous_action.py" to prevent future confusions.

dosssman commented 2 years ago

re-runs are looking good so far, not much difference with the previous iterations. ~It seems I will have to trouble you regarding the archival / removal of the olds td3 and ddpg runs. Wandb mentions missing permissions when I tried to either move the runs to the cleanrl-cache project or just delete them, before queuing up clean "td3_continuous_actions.py" and "ddpg_continuous_actions.py" runs.~

@vwxyzjn Do need some help to remove this runs, as I am not their author. image

image Thanks.