google-deepmind / mujoco_mpc

Real-time behaviour synthesis with MuJoCo, using Predictive Control
https://github.com/deepmind/mujoco_mpc
Apache License 2.0
913 stars 135 forks source link

Sample gradient planner (v2) #273

Closed thowell closed 5 months ago

thowell commented 6 months ago

Another experimental sampling-based planner:

Sample $N$ policies:

$$ \begin{align} \theta_k^{(i)} &= \theta_k + s d^{(i)}, \quad d_k^{(i)} \sim \mathcal{N}(0,I) \ \end{align} $$

where $\theta \in \mathbf{R}^{n}$ are parameters, $s \in \mathbf{R}_{++}$ is a scaling factor controlling the spread of samples, and $d \in \mathbf{R}^{n}$ is sample from a zero-mean Gaussian.

Second, construct an approximate gradient:

$$ \begin{align} gk &= \frac{1}{s \cdot N} \sum \limits{i = 1}^{N} \bar{R}(\theta_k^{(i)}) \cdot d_k^{(i)}\ \end{align} $$

where $g \in \mathbf{R}^{n}$, and $R : \mathbf{R}^{n} \rightarrow \mathbf{R}$ is the return of the policy (where the overbar ($\bar{\cdot}$) represents a shaped return (e.g., rank-preserving fitness shaping or subtracting the mean return).

These points are used to construct another set of $M$ policies along the gradient direction:

$$\phi_k^{(j)} = \theta_k - \alpha \big(t \cdot gk + (1 - t) \cdot g{k-1} \big), \quad \alpha \in [0, 2], \quad t \in [0, 1]$$

The parameter $t$ is selected to retain information from the previous gradient, providing a "filter".

Finally, the new policy is the best among the random policies $\thetak^{(i)}$ and previous gradient-direction policies $\phi{k-1}^{(j)}$:

$$ \theta_{k+1} ={\text{argmin }} {R(\theta_k), R(\theta_k^{(1)}), \dots, R(\thetak^{(N)}), R(\phi{k-1}^{(1)}), \dots, R(\phi_{k-1}^{(M)})}$$

Note: we compare the previous set of gradient-direction samples (i.e., $k-1$) to the current set of sample parameters (i.e., $k$) since these evaluations can be performed in parallel.

Idea: simple evolutionary strategy + gradient descent w/ linesearch.