why no tanh after sample action in ppo

zhixiongzh commented 1 year ago

[x] I have marked all applicable categories:
- [ ] exception-raising bug
- [x] RL algorithm bug
- [ ] system worker bug
- [ ] system utils bug
- [ ] code design/refactor
- [x] documentation request
- [ ] new feature request
[x] I have visited the readme and doc
[x] I have searched through the issue tracker and pr tracker

[x] I have mentioned version numbers, operating system and environment, where applicable:

import ding, torch, sys
print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.4.9 2.0.1 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] linux

Hi,

for mujoco games, the action space is in the range of [-1,1]. When I use sac, the action will be sample from the prediction distribution and then apply a tanh function after sampling the action, making sure the action is in the range of [-1,1]. But I do not see such operation in ppo, i.e. the action from ppo and feed into the env will be out of the range. Is there any explanation for the lack of tanh in ppo?

zhixiongzh commented 1 year ago

I check the code in stable baseline3, they have applied clip to make sure the action is in the range

            # Rescale and perform action
            clipped_actions = actions
            # Clip the actions to avoid out of bound error
            if isinstance(self.action_space, spaces.Box):
                clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high)

Is this a bug in DI-ENGINE or I just miss it?

zhixiongzh commented 1 year ago

maybe another question is, if the low and high bound of the env is defined to be [-1,1], how do DI-ENGINE ensure the action feed into the env is in the correct region? I do not see any clip operation in collector or env manager, do I miss it?

PaParaZz1 commented 1 year ago

maybe another question is, if the low and high bound of the env is defined to be [-1,1], how do DI-ENGINE ensure the action feed into the env is in the correct region? I do not see any clip operation in collector or env manager, do I miss it?

You can refer to this demo code to find how DI-engine deals with this problem for PPO. Note it is necessary to clip action in environement rather than in policy.

zhixiongzh commented 1 year ago

@PaParaZz1 Thanks for the example, it is exactly what I am looking for. Based on this example you provide, I think there is a bug: Not all environments in DI-Engine by default clip the action. For example, in mujoco, self._action_clip is False by default. I do not understand why False is a default value, as it is a requirement the action have to satisfy instead of a choice. and I am actually training ppo in mujoco environment, the config file di-engine provides does not specify the self._action_clip so that using the config makes the training result at a strange position, because the action is not clipped in the correct region even though the final reward could be good. I think by default this should not to be false.

https://github.com/opendilab/DI-engine/blob/6e93b4ca840b4eba424a5ae5ed4a55240173a394/dizoo/mujoco/envs/mujoco_env.py#L108

opendilab / DI-engine

why no tanh after sample action in ppo #716