google-deepmind / mujoco_mpc

Real-time behaviour synthesis with MuJoCo, using Predictive Control
https://github.com/deepmind/mujoco_mpc
Apache License 2.0
913 stars 135 forks source link

Sample gradient planner #260

Closed thowell closed 5 months ago

thowell commented 6 months ago

Experimental sampling-based planner:

First, sample $2 n_p$ policies:

$$ \begin{align} \theta^{(i)} &= \theta + s e_i, \quad i = 1, \dots, n_p \ \theta^{(i + n_p)} &= \theta - s e_i, \quad i = 1, \dots, n_p \end{align} $$

where $\theta \in \mathbf{R}^{np}$ is the nominal set of parameters, $s \in \mathbf{R}{++}$ is a scaling control spread of samples, and $e_i \in \mathbf{R}^{n_p}$ is the $i$-th basis vector.

Second, construct approximate gradient and (diagonal) Hessian using finite differencing:

$$ \begin{align} g^{(i)} &= \frac{R(\theta^{(i)}) - R(\theta^{(i + n_p)})}{2 s}\ H^{(i, i)} &= \frac{R(\theta^{(i)}) - 2 R(\theta) + R(\theta^{(i + n_p)})}{s^2} \end{align} $$

where $g \in \mathbf{R}^{np}$, $H \in \mathbf{S}{++}^{n_p}$, and $R : \mathbf{R}^{n_p} \rightarrow \mathbf{R}$ is the return of the policy.

The Cauchy point:

$$\theta^{(\text{C})} = \theta - \frac{g^T g}{g^T H g} \cdot g$$

and approximate Newton point ($H$ is diagonal):

$$ \theta^{(\text{N})} = \theta - H^{-1} g$$

are computed.

These points are used to construct another set of policies:

$$\phi^{(t)} = (1 - t) \cdot \theta^{(\text{C})} + t \cdot \theta^{(\text{N})}, \quad t \in [0, 1]$$

Finally, the new policy is the best among the perturbed policies $\theta^{(i)}$ and line-search policies $\phi^{(t)}$:

$$ \theta ={\text{argmin }} {R(\theta), R(\theta^{(0)}), \dots, R(\theta^{(2 n_p)}), R(\phi^{(0)}), \dots, R(\phi^{(1)})}$$

Importantly, in the case where perturbed parameters violate bounds, alternative perturbations and finite-differencing schemes are employed.

Ideas from dog leg method.

thowell commented 6 months ago

It's more challenging to change the winner color because Agent performs this visualization and Planner has no access. For now, I have modified the plot to step between 3 levels to indicated: perturb (-6), nominal (0), or gradient (6) winner.

image