google / brax

Massively parallel rigidbody physics simulation on accelerator hardware.
Apache License 2.0
2.34k stars 255 forks source link

Are control actions scaled in BRAX environments? #472

Closed nic-barbara closed 5 months ago

nic-barbara commented 7 months ago

Networks used as control policies in BRAX seem to have a tanh layer on the output to constrain actions to [-1,1]. However, many of the environments in BRAX have action spaces with a range greater then [-1, 1]. For example, the inverted_pendulum environment accepts actions in the range [-3,3].

Is there somewhere that scales the policy output to the actuator ranges for a given environment? Or are all control policies in BRAX currently restricted to actions in [-1,1]?

Thanks in advance for any advice/help!

btaba commented 7 months ago

Hi @nic-barbara

It looks like pendulum and reacher envs are affected by this bug, where we don't scale the actions. Feel free to send over a PR where you scale the action, here's a reference of how that would be implemented:

https://github.com/Farama-Foundation/Gymnasium/blob/373ccf0e005efc2835fa25a56aa4058960de711f/gymnasium/envs/mujoco/mujoco_env.py#L97-L101

This was also implemented here, but it's unused:

https://github.com/google/brax/blob/2329ae76759e37b0b1f1861cf34e5a67d0f7efa8/brax/envs/wrappers/gym.py#L50-L51

nic-barbara commented 7 months ago

Thanks @btaba, I'll take a look! Should we do the same for the humanoid and humanoidstandup environments too? The humanoid is restricted to [-0.4,0.4] on all control inputs which means the policy output will just saturate rather than smoothly hitting the [-1,1] boundaries of tanh. This might make training more difficult?

btaba commented 7 months ago

AFAIU we were working off of humanoid-v4, which is in [-1, 1]. I would look at the docstrings in brax. It looks like Farama deleted the docstrings for their older versions...

https://github.com/google/brax/blob/2329ae76759e37b0b1f1861cf34e5a67d0f7efa8/brax/envs/humanoid.py#L48

In practice, I tested that training curves and behaviors for all environments look good, (at the time when these environments were implemented). I compared training curves and behaviors in video to an older version of brax, across all physics backends. It'd be awesome if you could do a similar exercise for environments you edit, to show that policies are at least as good as the base version.

nic-barbara commented 7 months ago

If I have time I'll do the same, thanks for the suggestion. Unfortunately I don't have a huge amount of compute power so it might have to wait a while.

You're right that the humanoid says it uses [-1,1] in the docstring, but the actual humanoid.xml file still seems to limit the control inputs with ctrlrange="-.4 .4":

https://github.com/google/brax/blob/2329ae76759e37b0b1f1861cf34e5a67d0f7efa8/brax/envs/assets/humanoid.xml#L6

btaba commented 7 months ago

Interesting, that's probably why they changed it in v5 :). In this case, the simulator is clipping the actions, and that hasn't been an obvious issue for training humanoid. But it'd be good to ablate if you find the time!

nic-barbara commented 7 months ago

@btaba I just submitted https://github.com/google/brax/pull/473, let me know what you think.