Closed sql-hkr closed 1 year ago
wandb.log()
に関しても一部修正
wandb.log(
{
"q1_pred": q1_pred.mean(),
"q2_pred": q2_pred.mean(),
"q_target": q_target.mean(),
"actor_q": q_new_actions.mean(),
"alpha": alpha,
"loss/q1": qf1_loss,
"loss/q2": qf2_loss,
"loss/alpha": alpha_loss,
"loss/policy": policy_loss,
},
step=self._current_epoch,
)
SACアルゴリズムの損失関数に以下のペナルティー項を追加する.
$$ \mathcal{L}_Q(\thetai) \triangleq \mathbb{E}{(s,a,s')\sim \mathcal{D},a'\sim \pi\phi(\cdot \mid s')} \left[ \frac{1}{2} \left( y-Q{\thetai}(s,a) \right)^2\right] + \mathbb{E}{s\sim \mathcal{D},a\sim \pi\phi(\cdot \mid s)} \left[ Q{\theta_i}(s,a) \right] $$
ただし, $y$ は
$$ y \triangleq r(s,a)+\gamma \min{j=1,2}Q{\thetaj'}(s',a')-\alpha \log \pi\phi(a'\mid s') $$
である.