does training advantage soft actor critic based on replay buffer has large bias?

shariqiqbal2810 / MAAC

Code for "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" ICML 2019

MIT License

645 stars 169 forks source link

does training advantage soft actor critic based on replay buffer has large bias? #19

Closed KK666-AI closed 4 years ago

KK666-AI commented 4 years ago

Dear Author,

I take a fast look at your code on actor updates. It seems that you have use advantage soft actor critic, i.e., Advantage: pol_target = q - v loss: pol_loss = (log_pi * (log_pi / self.reward_scale - pol_target).detach()).mean()

If you use the above updates, I think it's an on-policy soft A2C, therefore unbiased actor should only be updated based on the incremental data rather than the data from replay buffer. Otherwise, it will be an biased estimate of the real policy. Right?

Best, Hui

shariqiqbal2810 commented 4 years ago

Not sure if I understand your question. The loss function itself doesn't determine whether an algorithm is on or off-policy but rather the source of the data does. We sample from a replay buffer which makes our algorithm off-policy. The advantage function is just a way to reduce variance in the policy gradient and is known to be unbiased. In fact, I believe I based this approach on an older version of the SAC paper which used a score function estimator with an advantage function rather than the reparameterization trick which the newest version of the paper uses.

KK666-AI commented 4 years ago

OK. I also read the old version of SAC. I think the old version is exact the advantage actor critic plus entropy regularization. The advantage has nothing about the bias, but the policy does. When you use A2C to direct update policy like REINFORCE, it's an on-policy method. That is, you should only use the on-policy samples rather than the older samples, which are generated by older policy rather than the current policy. If you use the older samples, importance sampling should be used to correct the bias.

shariqiqbal2810 commented 4 years ago

I think I see where the confusion is. Importance sampling is used to correct for using the actions from the old policies. We are in fact (as in all off-policy actor-critic methods) re-sampling the actions for the update function from the current policies. As such, importance sampling is not needed.

KK666-AI commented 4 years ago

I take a look at your code again. Exactly, you re-sample the action use curr_ac, probs, log_pi, pol_regs, ent = pi(ob, return_all_probs=True, return_log_pi=True, regularize=True, return_entropy=True)

Yes. It's correct. There is nothing about the importance sampling. Great.

I close this issue.

KK666-AI commented 4 years ago

I have another concern. Although the log_pi are generated by the current policy, but the training data are random sampled from the reservoir (that is the replay buffer). However, the data in the reservoir are generated by the old policy. So that it's still a biased estimate. Right?

shariqiqbal2810 commented 4 years ago

You are correct (the state distribution trained on does not match the policy's state distribution), but this is standard for any off-policy method. Here is a paper that explores the problem, and here is an example of work that attempts to address it.

KK666-AI commented 4 years ago

Thanks. I will explore this problem later.