openai / mujoco-py

MuJoCo is a physics engine for detailed, efficient rigid body simulations with contacts. mujoco-py allows using MuJoCo from Python 3.
Other
2.85k stars 814 forks source link

Mujoco Ant-v2 didn't restart the env when ant is flip over #599

Open johnnylin110 opened 3 years ago

johnnylin110 commented 3 years ago

I am using the Mujoco Ant-v2 enviroment with my DDPG model However, my reward can only get about 300-400 average. Thus I check the env.render to see more detail , but i see that when the ant is flip over the done from env.step(action) is still False, which let the Ant to hit the Max episode =1000(env.render is5000) to restart and get the survive reward every time because when my reward is high (about 600~700 ) it always show the ant is flip over and didn't forward it looks like my model learn that to flip over is the best reward Is this a common situation? and can someone tells when will the Ant "done" set to True ? Because I see the original Ant code is about

state = self.state_vector()
notdone = np.isfinite(state).all()  and state[2] >= 0.2 and state[2] <= 1.0
done = not notdone

But I can't get the idea when the Ant done set to True . Thanks!

johnnylin110 commented 3 years ago

update: I also check the paper from Benchmarking Deep Reinforcement Learning for Continuous Control explain the Ant environment that image where zbody is the z-coordinate of the body this correspond to the code in Ant-v2 image where state[2]>=0.2 and state[2]<=1 will continue , But in the Ant-v2 I saw, it just continue when Ant flip over , Is it possible the Ant do the flip over without violate those condition?

dkkim93 commented 3 years ago

Thanks for the interesting question! When I printed the z-coordinate of the body when the Ant robot was flipped over, the values were: 0.259, 0.261, 0.273, 0.302, etc. Hence, I wonder whether it is correct to modify the code such that when the z-coordinate of the body constantly falls under around 0.3, then the environment resets since the robot is flipped over.

johnnylin110 commented 3 years ago

@dkkim93 Thanks for your reply. I have the same observation too, and I was wondering is this just like a local optimal for this case ? Ant can flip over or stay still to get the survive reward(=1) in each step . So if the Ant flip over or stay still with small joint control (because the reward also has one term deal with the joint control penalty),the ant can get almost 900 reward (local optimal). However,the best reward is still move forward as fast as it can , I have tune the parameters for my ddpg and get the reward like 1200 for some episode and it forward quickly.

aowen87 commented 2 years ago

I've been noticing the same behavior. It's kind of unclear if this is intentional or not...

CUN-bjy commented 1 year ago

I had the same issue. So, we need to adjust healthy_z_range for our desired task.

I hope this gif video helps you guys! This provides the same intuition with a comment above

ezgif com-gif-maker (1)