Hey, this code and the AWAC paper are awesome! thanks for sharing this library; I've been reading some of it lately trying to understand and apply the AWAC paper:)
Is there any reason for the use of Q(s,a) directly obtained from doing min(Q1(s,a), Q2(s,a)) functions instead of the 1-step returns as r(s,a) + min(Q1(s',a'), Q2(s', a')); couldn't that introduce more bias to the estimate of the returns?
Oh, also, I did some tests switching it to the latter implementation on my implementation on a different problem and the results were very similar so it's unclear to me if there's actually any benefit from switching from one implementation to the other.
Hey, this code and the AWAC paper are awesome! thanks for sharing this library; I've been reading some of it lately trying to understand and apply the AWAC paper:)
However, I had a doubt about on how the Q(s,a) term of the advantage function is implemented in the library: https://github.com/rail-berkeley/rlkit/blob/c81509d982b4d52a6239e7bfe7d2540e3d3cd986/rlkit/torch/sac/awac_trainer.py#L554 Where q1_pred and q2_pred are both directly calculated using the learned Q1 and Q2 functions.
I was wondering about this since, as I understand, the code is using Q(s,a) directly instead of the 1-step returns: r(s,a) + Q(s', a') to compute the Q(s,a) term in the advantage function. It seems to me that there's already have the 1-step returns estimate computed under the variable q_target: https://github.com/rail-berkeley/rlkit/blob/c81509d982b4d52a6239e7bfe7d2540e3d3cd986/rlkit/torch/sac/awac_trainer.py#L496
Is there any reason for the use of Q(s,a) directly obtained from doing min(Q1(s,a), Q2(s,a)) functions instead of the 1-step returns as r(s,a) + min(Q1(s',a'), Q2(s', a')); couldn't that introduce more bias to the estimate of the returns?
Oh, also, I did some tests switching it to the latter implementation on my implementation on a different problem and the results were very similar so it's unclear to me if there's actually any benefit from switching from one implementation to the other.
Thanks in advance!