Closed zhixiongzh closed 11 months ago
I check the code in stable baseline3
, they have applied clip
to make sure the action is in the range
# Rescale and perform action
clipped_actions = actions
# Clip the actions to avoid out of bound error
if isinstance(self.action_space, spaces.Box):
clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high)
Is this a bug in DI-ENGINE
or I just miss it?
maybe another question is, if the low and high bound of the env is defined to be [-1,1]
, how do DI-ENGINE ensure the action feed into the env is in the correct region? I do not see any clip operation in collector or env manager, do I miss it?
maybe another question is, if the low and high bound of the env is defined to be
[-1,1]
, how do DI-ENGINE ensure the action feed into the env is in the correct region? I do not see any clip operation in collector or env manager, do I miss it?
You can refer to this demo code to find how DI-engine deals with this problem for PPO. Note it is necessary to clip action in environement rather than in policy.
@PaParaZz1
Thanks for the example, it is exactly what I am looking for. Based on this example you provide, I think there is a bug: Not all environments in DI-Engine by default clip the action. For example, in mujoco, self._action_clip
is False
by default. I do not understand why False
is a default value, as it is a requirement the action have to satisfy instead of a choice. and I am actually training ppo in mujoco environment, the config file di-engine provides does not specify the self._action_clip
so that using the config makes the training result at a strange position, because the action is not clipped in the correct region even though the final reward could be good. I think by default this should not to be false.
Hi,
for mujoco games, the action space is in the range of
[-1,1]
. When I usesac
, the action will be sample from the prediction distribution and then apply atanh
function after sampling the action, making sure the action is in the range of[-1,1]
. But I do not see such operation inppo
, i.e. the action fromppo
and feed into theenv
will be out of the range. Is there any explanation for the lack oftanh
inppo
?