PPO reward normalization works only for default gamma

vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

http://docs.cleanrl.dev

Other

5.54k stars 631 forks source link

PPO reward normalization works only for default gamma #203

Closed Howuhh closed 2 years ago

Howuhh commented 2 years ago

Problem Description

Current implementation of continuous action PPO uses gym.wrappers.NormalizeReward with default gamma value, for all other gamma's except default 0.99 this normalization will be not correct. https://github.com/vwxyzjn/cleanrl/blob/94a685de9290435623d7cf5e4e770418ddb10a4f/cleanrl/ppo_continuous_action.py#L92

Possible Solution

Very easy, just add gamma=args.gamma as an argument to the normalization wrapper.

Howuhh commented 2 years ago

If this is really a problem I will make a fix PR.

vwxyzjn commented 2 years ago

Oh this makes sense! Thanks for raising the issue. I think ppo_procgen.py and ppg_procgen.py also use the reward normalization - feel free to submit a PR to fix them.

Howuhh commented 2 years ago

Fixed.