Question About Ant Behavior

Hi folks, I have a question on how the ant environment behaves initially. I'm training RL policies with 2 layer mlp's using PPO and noticed that the initial rewards become quite negative before the policy begins to learn. I understand that this could be due to a myriad of differences in my PPO implementation, hyperparameters, model architecture etc. However, when I visualize just a randomly initialized policy, I see that sometimes ant flips over and accumulates large negative rewards up until timeout termination. Here's a screenshot that visualizes what's happening.

ant_flipped_over

Is this correct behavior on the environment side? I would have thought there would be some termination condition if the ant flips over like that so that this doesn't continue until timeout termination. Or maybe I'm missing something.

google / brax

Question About Ant Behavior #272