Closed KK666-AI closed 4 years ago
Not sure if I understand your question. The loss function itself doesn't determine whether an algorithm is on or off-policy but rather the source of the data does. We sample from a replay buffer which makes our algorithm off-policy. The advantage function is just a way to reduce variance in the policy gradient and is known to be unbiased. In fact, I believe I based this approach on an older version of the SAC paper which used a score function estimator with an advantage function rather than the reparameterization trick which the newest version of the paper uses.
OK. I also read the old version of SAC. I think the old version is exact the advantage actor critic plus entropy regularization. The advantage has nothing about the bias, but the policy does. When you use A2C to direct update policy like REINFORCE, it's an on-policy method. That is, you should only use the on-policy samples rather than the older samples, which are generated by older policy rather than the current policy. If you use the older samples, importance sampling should be used to correct the bias.
I think I see where the confusion is. Importance sampling is used to correct for using the actions from the old policies. We are in fact (as in all off-policy actor-critic methods) re-sampling the actions for the update function from the current policies. As such, importance sampling is not needed.
I take a look at your code again. Exactly, you re-sample the action use
curr_ac, probs, log_pi, pol_regs, ent = pi(ob, return_all_probs=True, return_log_pi=True, regularize=True, return_entropy=True)
Yes. It's correct. There is nothing about the importance sampling. Great.
I close this issue.
I have another concern. Although the log_pi are generated by the current policy, but the training data are random sampled from the reservoir (that is the replay buffer). However, the data in the reservoir are generated by the old policy. So that it's still a biased estimate. Right?
Thanks. I will explore this problem later.
Dear Author,
I take a fast look at your code on actor updates. It seems that you have use advantage soft actor critic, i.e., Advantage:
pol_target = q - v
loss:pol_loss = (log_pi * (log_pi / self.reward_scale - pol_target).detach()).mean()
If you use the above updates, I think it's an on-policy soft A2C, therefore unbiased actor should only be updated based on the incremental data rather than the data from replay buffer. Otherwise, it will be an biased estimate of the real policy. Right?
Best, Hui