Doubt on advantage calculation to update the policy on AWAC.

Hey, this code and the AWAC paper are awesome! thanks for sharing this library; I've been reading some of it lately trying to understand and apply the AWAC paper:)

However, I had a doubt about on how the Q(s,a) term of the advantage function is implemented in the library: https://github.com/rail-berkeley/rlkit/blob/c81509d982b4d52a6239e7bfe7d2540e3d3cd986/rlkit/torch/sac/awac_trainer.py#L554 Where q1_pred and q2_pred are both directly calculated using the learned Q1 and Q2 functions.

I was wondering about this since, as I understand, the code is using Q(s,a) directly instead of the 1-step returns: r(s,a) + Q(s', a') to compute the Q(s,a) term in the advantage function. It seems to me that there's already have the 1-step returns estimate computed under the variable q_target: https://github.com/rail-berkeley/rlkit/blob/c81509d982b4d52a6239e7bfe7d2540e3d3cd986/rlkit/torch/sac/awac_trainer.py#L496

Is there any reason for the use of Q(s,a) directly obtained from doing min(Q1(s,a), Q2(s,a)) functions instead of the 1-step returns as r(s,a) + min(Q1(s',a'), Q2(s', a')); couldn't that introduce more bias to the estimate of the returns?

Oh, also, I did some tests switching it to the latter implementation on my implementation on a different problem and the results were very similar so it's unclear to me if there's actually any benefit from switching from one implementation to the other.

Thanks in advance!

rail-berkeley / rlkit

Doubt on advantage calculation to update the policy on AWAC. #160