Policy update after finding best rollout

nrkumar93 commented 1 year ago

In iLQG planner, the nominal policy is updated with candidate_policy[winner] where winner is the index of the best rollout. However, only the 0th index of the candidate_policy (candidate_policy[0]) is assigned by trajectory[winner], not candidate_policy[winner].

Is this correct or is there a minor bug?

Thank you so much for releasing this wonderful package. I am learning so many implementation tricks to get optimal control methods working reliably.

thowell commented 1 year ago

Great question!

This is not a bug. In cases where the function iLQGPlanner::Iteration is called multiple times in a row, we need to update candidate_policy[0] with the latest trajectory[winner] before the function is called again. We could copy the entire policy (i.e., candidate_policy[0] = candidate_policy[winner]), but this is a more expensive operation. The line efficiently copies the necessary information.

The way we are implementing MPC is to call iLQGPlanner::Iteration once for the current state. However, in cases where you want to improve your solution further (i.e., multiple iterations of improvement for the current state) this copy functionality is useful.

nrkumar93 commented 1 year ago

Thanks for the quick response, I think my original post was not very clear.

Here is the linearized code flow relevant to my question. During every iteration,

/// in iLQGPlanner::NominalTrajectory() 
candidate_policy[0:num_trajectory] = policy; /// line 169, copy nominal policy into every candidate policy
candidate_policy[0].trajectory = trajectory[best_nominal]; /// line 191, copy the lowest cost trajectory to candidate_policy[0].trajectory

/// in iLQGPlanner::Iteration() (after backpass)
candidate_policy[0:num_trajectory] = candidate_policy[0]; /// line 518, copy candidate_policy[0] into every candidate policy
candidate_policy[0].trajectory = trajectory[winner]; /// line 536, copy the lowest cost trajectory to candidate_policy[0]
policy = candidate_policy[winner]; /// line 578, copy the candidate_policy[winner] to the nominal policy

Per your logic, shouldn't the last line (line 578) be policy = candidate_policy[0] because only candidate_policy[0] contains trajectory[winner] and not candidate_policy[winner].

Once again, this is not anything major, just a minor observation/doubt. Thanks

thowell commented 1 year ago

The updates/iterations here are a bit subtle. It would be helpful to provide a detailed algorithm for our implementation of iLQG. I will add this to my todo list.

Looking at the code more closely, line 536 actually has no impact when iLQGPlanner::OptimizePolicy is called as is. In the case I describe above where you might want to perform multiple iterations to improve your solution, we actually need to remove the policy update (lines: 572-588) from iLQGPlanner::Iteration, perhaps calling it in line 148 after multiple calls to iLQGPlanner::Iteration.
The policy update (line 578) is correct. We are copying the policy (feedback gains, actions, and states) that generated the lowest cost during the candidate feedback rollouts. Note: the feedback gains are the same for all of the candidate policies; the actions are the nominal actions plus the action improvement computed during the backward pass and scaled by the line search step size; the states are the nominal states. Importantly, we do not update the policy using the action and state trajectories that are the result of the best candidate policy. However, if the current state remains constant between subsequent policy updates (i.e., iLQG iterations), the nominal rollout will exactly recover the actions and states from the previous best candidate policy.

Hopefully this provides clarification.

google-deepmind / mujoco_mpc

Policy update after finding best rollout #97