google / brax

Massively parallel rigidbody physics simulation on accelerator hardware.
Apache License 2.0
2.26k stars 249 forks source link

Question About Ant Behavior #272

Closed SumeetBatra closed 1 year ago

SumeetBatra commented 1 year ago

Hi folks, I have a question on how the ant environment behaves initially. I'm training RL policies with 2 layer mlp's using PPO and noticed that the initial rewards become quite negative before the policy begins to learn. I understand that this could be due to a myriad of differences in my PPO implementation, hyperparameters, model architecture etc. However, when I visualize just a randomly initialized policy, I see that sometimes ant flips over and accumulates large negative rewards up until timeout termination. Here's a screenshot that visualizes what's happening.

ant_flipped_over

Is this correct behavior on the environment side? I would have thought there would be some termination condition if the ant flips over like that so that this doesn't continue until timeout termination. Or maybe I'm missing something.

btaba commented 1 year ago

Hi @SumeetBatra the environment should terminate when terminate_when_unhealthy is True (default), see https://github.com/google/brax/blob/main/brax/envs/ant.py#L241. However, if you're using something like the AutoResetWrapper, the environment will get reset automatically. Perhaps your random initialization is flipping the Ant, and the env keeps getting reset to the unhealthy state?