Closed thowell closed 5 months ago
It's more challenging to change the winner color because Agent
performs this visualization and Planner
has no access. For now, I have modified the plot to step between 3 levels to indicated: perturb (-6), nominal (0), or gradient (6) winner.
Experimental sampling-based planner:
First, sample $2 n_p$ policies:
$$ \begin{align} \theta^{(i)} &= \theta + s e_i, \quad i = 1, \dots, n_p \ \theta^{(i + n_p)} &= \theta - s e_i, \quad i = 1, \dots, n_p \end{align} $$
where $\theta \in \mathbf{R}^{np}$ is the nominal set of parameters, $s \in \mathbf{R}{++}$ is a scaling control spread of samples, and $e_i \in \mathbf{R}^{n_p}$ is the $i$-th basis vector.
Second, construct approximate gradient and (diagonal) Hessian using finite differencing:
$$ \begin{align} g^{(i)} &= \frac{R(\theta^{(i)}) - R(\theta^{(i + n_p)})}{2 s}\ H^{(i, i)} &= \frac{R(\theta^{(i)}) - 2 R(\theta) + R(\theta^{(i + n_p)})}{s^2} \end{align} $$
where $g \in \mathbf{R}^{np}$, $H \in \mathbf{S}{++}^{n_p}$, and $R : \mathbf{R}^{n_p} \rightarrow \mathbf{R}$ is the return of the policy.
The Cauchy point:
$$\theta^{(\text{C})} = \theta - \frac{g^T g}{g^T H g} \cdot g$$
and approximate Newton point ($H$ is diagonal):
$$ \theta^{(\text{N})} = \theta - H^{-1} g$$
are computed.
These points are used to construct another set of policies:
$$\phi^{(t)} = (1 - t) \cdot \theta^{(\text{C})} + t \cdot \theta^{(\text{N})}, \quad t \in [0, 1]$$
Finally, the new policy is the best among the perturbed policies $\theta^{(i)}$ and line-search policies $\phi^{(t)}$:
$$ \theta ={\text{argmin }} {R(\theta), R(\theta^{(0)}), \dots, R(\theta^{(2 n_p)}), R(\phi^{(0)}), \dots, R(\phi^{(1)})}$$
Importantly, in the case where perturbed parameters violate bounds, alternative perturbations and finite-differencing schemes are employed.
Ideas from dog leg method.