Closed nic-barbara closed 5 months ago
Hi @nic-barbara
It looks like pendulum and reacher envs are affected by this bug, where we don't scale the actions. Feel free to send over a PR where you scale the action, here's a reference of how that would be implemented:
This was also implemented here, but it's unused:
Thanks @btaba, I'll take a look! Should we do the same for the humanoid
and humanoidstandup
environments too? The humanoid
is restricted to [-0.4,0.4]
on all control inputs which means the policy output will just saturate rather than smoothly hitting the [-1,1]
boundaries of tanh
. This might make training more difficult?
AFAIU we were working off of humanoid-v4, which is in [-1, 1]
. I would look at the docstrings in brax
. It looks like Farama deleted the docstrings for their older versions...
In practice, I tested that training curves and behaviors for all environments look good, (at the time when these environments were implemented). I compared training curves and behaviors in video to an older version of brax, across all physics backends. It'd be awesome if you could do a similar exercise for environments you edit, to show that policies are at least as good as the base version.
If I have time I'll do the same, thanks for the suggestion. Unfortunately I don't have a huge amount of compute power so it might have to wait a while.
You're right that the humanoid says it uses [-1,1]
in the docstring, but the actual humanoid.xml
file still seems to limit the control inputs with ctrlrange="-.4 .4"
:
Interesting, that's probably why they changed it in v5 :). In this case, the simulator is clipping the actions, and that hasn't been an obvious issue for training humanoid. But it'd be good to ablate if you find the time!
@btaba I just submitted https://github.com/google/brax/pull/473, let me know what you think.
Networks used as control policies in BRAX seem to have a
tanh
layer on the output to constrain actions to[-1,1]
. However, many of the environments in BRAX have action spaces with a range greater then[-1, 1]
. For example, theinverted_pendulum
environment accepts actions in the range[-3,3]
.Is there somewhere that scales the policy output to the actuator ranges for a given environment? Or are all control policies in BRAX currently restricted to actions in
[-1,1]
?Thanks in advance for any advice/help!