where $\theta \in \mathbf{R}^{n}$ are parameters, $s \in \mathbf{R}_{++}$ is a scaling factor controlling the spread of samples, and $d \in \mathbf{R}^{n}$ is sample from a zero-mean Gaussian.
where $g \in \mathbf{R}^{n}$, and $R : \mathbf{R}^{n} \rightarrow \mathbf{R}$ is the return of the policy (where the overbar ($\bar{\cdot}$) represents a shaped return (e.g., rank-preserving fitness shaping or subtracting the mean return).
These points are used to construct another set of $M$ policies along the gradient direction:
Note: we compare the previous set of gradient-direction samples (i.e., $k-1$) to the current set of sample parameters (i.e., $k$) since these evaluations can be performed in parallel.
Another experimental sampling-based planner:
Sample $N$ policies:
$$ \begin{align} \theta_k^{(i)} &= \theta_k + s d^{(i)}, \quad d_k^{(i)} \sim \mathcal{N}(0,I) \ \end{align} $$
where $\theta \in \mathbf{R}^{n}$ are parameters, $s \in \mathbf{R}_{++}$ is a scaling factor controlling the spread of samples, and $d \in \mathbf{R}^{n}$ is sample from a zero-mean Gaussian.
Second, construct an approximate gradient:
$$ \begin{align} gk &= \frac{1}{s \cdot N} \sum \limits{i = 1}^{N} \bar{R}(\theta_k^{(i)}) \cdot d_k^{(i)}\ \end{align} $$
where $g \in \mathbf{R}^{n}$, and $R : \mathbf{R}^{n} \rightarrow \mathbf{R}$ is the return of the policy (where the overbar ($\bar{\cdot}$) represents a shaped return (e.g., rank-preserving fitness shaping or subtracting the mean return).
These points are used to construct another set of $M$ policies along the gradient direction:
$$\phi_k^{(j)} = \theta_k - \alpha \big(t \cdot gk + (1 - t) \cdot g{k-1} \big), \quad \alpha \in [0, 2], \quad t \in [0, 1]$$
The parameter $t$ is selected to retain information from the previous gradient, providing a "filter".
Finally, the new policy is the best among the random policies $\thetak^{(i)}$ and previous gradient-direction policies $\phi{k-1}^{(j)}$:
$$ \theta_{k+1} ={\text{argmin }} {R(\theta_k), R(\theta_k^{(1)}), \dots, R(\thetak^{(N)}), R(\phi{k-1}^{(1)}), \dots, R(\phi_{k-1}^{(M)})}$$
Note: we compare the previous set of gradient-direction samples (i.e., $k-1$) to the current set of sample parameters (i.e., $k$) since these evaluations can be performed in parallel.
Idea: simple evolutionary strategy + gradient descent w/ linesearch.