[15분] TD3 관련 논문 소개

Overestimation, TD3에 대해서 소개합니다.

궁금해서 찾아본 overestimation bias

1. The Optimizer's Curse true value(mu)를 알 수 없는 상황에서 estimated values(V_i^*)를 i에 대해서 alternative 하게 max 값으로 선택하여 true value를 estimation할 때 bias가 생기는 것을 보이는 논문입니다. 쉽게 보이는 예시는 1.1 Some Prototypical Examples 인데, 정규분포(mu=0, std=1) 3개로부터 sample을 뽑아 maximal value estimate가 true distribution의 mean보다 0.85 만큼 큰 것을 보입니다.

Q-learning에서 학습할 때 Q(s, a)를 Max(Q(s', a))를 활용해서 update 하기 때문에 여기에서 Q(s', a)가 위 논문에서의 estimated values로 볼 수 있고 max 값을 계속 학습하는 경우에 bias가 생길 수 있음을 알 수 있습니다.
기타 예시: https://towardsdatascience.com/double-q-learning-the-easy-way-a924c4085ec3

2. Issues in Using Function Approximation for Reinforcement Learning Q-learning 시에 function approximator로 근사할 경우 overestimation bias가 생길 수 있음을 최초로 보이는 논문입니다. look-up table을 사용하는 경우엔 분포가 아니기 때문에 위의 optimizer's curse가 생기지 않지만 function approximator로 근사할 경우에 알 수 없는 noise가 생길 수 밖에 없고, 이는 overestimation bias를 야기합니다. 논문에서는 noise를 zero-mean으로 가정했고 Q^{approx} = (s', \hat a) = Q^{target}(s', \hat a) + Y_{s'}^{\hat a} 이 경우에, 종종 E[Z_s] = E[max_a Q^{approx}(s', \hat a) - max_a Q^{target}(s', \hat a)] > 0 임을 보입니다. 그리고 E[Z_s]가 그럼 얼마나 커지는 지 증명합니다.

TD3

Addressing Function Approximation Error in Actor-Critic Methods Double DQN에서 잘 되었던 내용이 actor-critic으로 넘어오면서 value estimates가 크게 변하지 않는 문제 때문에 overestimation bias 문제가 다시 발생합니다. 그래서 critic에도 같은 방식으로 적용해주니 bias가 줄어들었다고 합니다. 근데, 이 방식이 high variance를 유발하기 때문에 clipped Double Q-learning을 제시합니다.

[Contribution] (1) target network가 variance reduction에 필요하다는 것을 보인다. (2) value와 policy의 결합을 다루기 위해서 delaying policy updates until the value estimate has converged 를 한다. (3) 새로운 regularization 전략을 소개한다. (SARSA-style update bootstraps similar action estimates to further reduce variance)

[clipped double q-learning] 4장에서 actor-critic에서도 overestimation bias가 생길 수 있음을 보입니다. (이해 불가) 이를 해결해주기 위해서 Clipped Double Q-learning을 적용합니다. Double Q-learning에서 사용한 target network를 actor에 대해서도 가져오는 방식인데, 생각해보면 pi_1을 update할 때 Q_1에 대해서 optimize하고 Q_1은 target update를 independent estimate인 Q_2를 사용하는데 실제로는 replay memory 를 공유하기 때문에 critic 간에 independent 할 수 없어(double Q-learning에서 estimators를 update하는 subset of samples가 independent할 때 unbiased 할 수 있다고 가정합니다.), Q_2(s, pi_1) > Q_1(s, pi_1)이 발생할 수도 있게 됩니다. 이렇게 되면 Q_1이 점차적으로 true value를 overestimate 하게 됩니다. 그래서, 간단하게 upper-bound를 target 값을 Q_1, Q_2의 min 값으로 clipping 해버립니다. (의문: Q_1을 동일하게 사용하면 overestimation bias가 또 생기지 않나요?)

이렇게 update하게 되면 underestimation bias가 생길 수도 있지만, policy가 어차피 그 value에 관심이 없을테니 propagated 되지는 않을 것이라고 합니다.

실제로 구현할 때에는 actor는 pi_1에 대해서만 학습합니다. 그러기 때문에 Q_2에 대해서만 생각하면 됩니다. 일단, clipped target value Q_2에 의해서 y_1 = y_2가 성립합니다. Q_2가 Q_1보다 큰 경우, normal update되고 Q_2가 Q_1보다 작은 경우 bias가 생긴 것으로 볼 수 있습니다. (이해 안 됨.)

function approximator error를 r.v.로 다루게 되면 low-variance variable estimate를 한다고 합니다.

[Accumulating Error] TD update이기 때문에 매 update 당 약간의 error가 있을 수 있다고 합니다.

[target networks and delayed policy updates] target network update와 function approximation error 사이의 관계를 살펴봅니다. target network update는 안정적으로 학습할 수 있도록 도와줍니다. target network 없이는 residual error가 쌓여서 divergent.

Fig.3에서 보이고자 하는 건 value가 충분히 update 되어 error가 줄었을 때 policy를 update 해야 variance 를 줄일 수 있다는 것이고 실험으로 보입니다. 실제로 구현은 d번만큼 critic update한 뒤에 actor를 update 합니다. 그리고 target network 칭찬..

[target policy smoothing regularization] deterministic policy의 한계는 critic estimate 할 때 overfit 될 수 있다는 점인데, 이는 target의 variance를 야기합니다. 이 때, regularization이 도움을 줄 수 있습니다. 단순한 방법으로 target action에 약간의 noise를 줄 수 있습니다. noise는 clip(N(0, sigma), -c, c)의 범위로 한정합니다.

[ablation study] ablation study에서 Clipped Double Q-learning, Delayed Policy updates, Target Policy Smoothing 각각에 대해서 진행하는데 task 마다 값들이 다릅니다.

kairproject / schedule

[15분] TD3 관련 논문 소개 #21

궁금해서 찾아본 overestimation bias

TD3