Closed SumeetBatra closed 1 year ago
Hi @SumeetBatra the environment should terminate when terminate_when_unhealthy
is True (default), see https://github.com/google/brax/blob/main/brax/envs/ant.py#L241. However, if you're using something like the AutoResetWrapper, the environment will get reset automatically. Perhaps your random initialization is flipping the Ant, and the env keeps getting reset to the unhealthy state?
Hi folks, I have a question on how the ant environment behaves initially. I'm training RL policies with 2 layer mlp's using PPO and noticed that the initial rewards become quite negative before the policy begins to learn. I understand that this could be due to a myriad of differences in my PPO implementation, hyperparameters, model architecture etc. However, when I visualize just a randomly initialized policy, I see that sometimes ant flips over and accumulates large negative rewards up until timeout termination. Here's a screenshot that visualizes what's happening.
Is this correct behavior on the environment side? I would have thought there would be some termination condition if the ant flips over like that so that this doesn't continue until timeout termination. Or maybe I'm missing something.