google-deepmind / mujoco_mpc

Real-time behaviour synthesis with MuJoCo, using Predictive Control
https://github.com/deepmind/mujoco_mpc
Apache License 2.0
951 stars 142 forks source link

Policy update after finding best rollout #97

Closed nrkumar93 closed 1 year ago

nrkumar93 commented 1 year ago

In iLQG planner, the nominal policy is updated with candidate_policy[winner] where winner is the index of the best rollout. However, only the 0th index of the candidate_policy (candidate_policy[0]) is assigned by trajectory[winner], not candidate_policy[winner].

Is this correct or is there a minor bug?

Thank you so much for releasing this wonderful package. I am learning so many implementation tricks to get optimal control methods working reliably.

thowell commented 1 year ago

Great question!

This is not a bug. In cases where the function iLQGPlanner::Iteration is called multiple times in a row, we need to update candidate_policy[0] with the latest trajectory[winner] before the function is called again. We could copy the entire policy (i.e., candidate_policy[0] = candidate_policy[winner]), but this is a more expensive operation. The line efficiently copies the necessary information.

The way we are implementing MPC is to call iLQGPlanner::Iteration once for the current state. However, in cases where you want to improve your solution further (i.e., multiple iterations of improvement for the current state) this copy functionality is useful.

nrkumar93 commented 1 year ago

Thanks for the quick response, I think my original post was not very clear.

Here is the linearized code flow relevant to my question. During every iteration,

/// in iLQGPlanner::NominalTrajectory() 
candidate_policy[0:num_trajectory] = policy; /// line 169, copy nominal policy into every candidate policy
candidate_policy[0].trajectory = trajectory[best_nominal]; /// line 191, copy the lowest cost trajectory to candidate_policy[0].trajectory

/// in iLQGPlanner::Iteration() (after backpass)
candidate_policy[0:num_trajectory] = candidate_policy[0]; /// line 518, copy candidate_policy[0] into every candidate policy
candidate_policy[0].trajectory = trajectory[winner]; /// line 536, copy the lowest cost trajectory to candidate_policy[0]
policy = candidate_policy[winner]; /// line 578, copy the candidate_policy[winner] to the nominal policy

Per your logic, shouldn't the last line (line 578) be policy = candidate_policy[0] because only candidate_policy[0] contains trajectory[winner] and not candidate_policy[winner].

Once again, this is not anything major, just a minor observation/doubt. Thanks

thowell commented 1 year ago

The updates/iterations here are a bit subtle. It would be helpful to provide a detailed algorithm for our implementation of iLQG. I will add this to my todo list.

Hopefully this provides clarification.