google-deepmind / mujoco

Multi-Joint dynamics with Contact. A general purpose physics simulator.
https://mujoco.org
Apache License 2.0
7.49k stars 740 forks source link

"Vibrations" around the goal #1754

Open DanieleLiuni opened 1 week ago

DanieleLiuni commented 1 week ago

Hi, I'm a mechanical engineering student and I'm trying to use MuJoCo for Reinforcement Learning. As first attempt, I created a simple environment with a spherical body moved by a mocap and a target site the body has to reach. To move the sphere exactly with the mocap I created a weld joint with solimp='0.998 0.999 0.0001 0.1 6' solref='0.0015 0.7' (close to a hard constraint). The reward for the RL is considered to be the negative of the distance between the sphere and the target and the action is continuous between -0.04 and 0.04 (action = mocap delta position ).
Applying RL, I observe that the sphere reach the target but then it starts moving back and forth around the target at maximum action. Is it a problem related to how the mocap works ?

yuvaltassa commented 1 week ago

Sounds like maybe this is just the noise injected to the controller during RL? That's how RL works after all...

DanieleLiuni commented 1 week ago

Sorry, I probably missed the point. I can understand the movement of the sphere around the goal during learning since the agent has to explore the environment. But I can't understand why I have this problem of "vibrations" when I apply the trained agent (using 2 million learning timesteps). In theory, the agent should learn that, once the target is reached, the null action is the best one. A similar situation is shown in the "Reach" Fetch environment provided by Gymnasium Robotics: the robot doesn't stop after the target is reached. Since I want to study oscillations of DLOs, I need to remove this effect.

Balint-H commented 1 week ago

Most RL frameworks let you switch to a deterministic policy (taking the action with maximum probability) once finished learning since, depending on configurations, the agent may never converge to a fully zero variance policy on its own. Could you see if you have such an option available to you?

Also, its worth considering your delta mocap position magnitude. If its too large the action will overshoot the target, forcing the agent to adjust again. Lastly, what are the observations?

DanieleLiuni commented 1 week ago

I'm using PPO from Stablebaselines3, so I can switch to a deterministic policy when the policy predicts the action.However, the change doesn't provide any effect (same problem with the body moving back and forth at maximum action possible). self.action_space = spaces.Box(-0.04, 0.04, shape=(self.n_actions,), dtype="float32"), so the agent can choose an action between -0.04 and 0.04, the distance from the target is 0.45 and the distance threshold is 0.03. I don't think that delta position is too high since it's one order of magnitude lower than the distance. Finally, the observations are : self.observation_space = spaces.Dict( dict( desired_goal=spaces.Box(-np.inf, np.inf, shape=obs["achieved_goal"].shape, dtype="float64"), # target position achieved_goal=spaces.Box(-np.inf,np.inf, shape=obs["achieved_goal"].shape, dtype="float64"), # gripper position observation = spaces.Box(-np.inf, np.inf, shape = obs["observation"].shape,dtype="float64"), contact = spaces.Discrete(2), ) )

Inside "observation" I put the position and the velocity of the body, in "desired_goal" the target pos and in "achieved_goal" the position of the body again.

(Initially, I thought the problem was caused by the difference in the movement between mocap and body. Using solimp='0.998 0.999 0.0001 0.1 6' solref='0.0015 0.7' I have created an hard constraint between mocap and body to solve this issue)