Why not clean replay buffer after each episode for on-policy policy gradient update?

tedhuang96 commented 1 year ago

Thanks for providing the pytorch version of option critic. I want to ask why don't we clean replay buffer after each episode for on-policy policy gradient update? I think both algorithm 1 in the paper and the derivation for the intra-option policy gradient theorem is done under the setting of on-policy setup. If we do not clean the replay buffer, importance sampling should be implemented to account for the off-policy update. But I did not see any part of code related to that. I tried to read the original Theano repo, but it seems that they did the same thing. Do you have any comments on this?

lweitkamp commented 1 year ago

Hey, good question. As I recall, the policy-over-options is actually off-policy, whereas the intra-policies are calculated on-policy. We only use the replay buffer to update the policy-over-options. I believe they mention that the target g_t is indeed off-policy (text box above Algorithm 1). In the codebase you can see that we only use the replay buffer to update the policy-over-options:

https://github.com/lweitkamp/option-critic-pytorch/blob/fab40f7aae0ff45cf5945b7de79d5ae5446d31a0/main.py#L119-L121

Does that answer your question?

tedhuang96 commented 1 year ago

Thank you for your prompt reply! That clarifies my confusion about the use of replay buffer.

I have another question popping up while reading the code. In the calculation of critic loss function, I see you are computing the TD error between $gt^{(1)}$ and $Q{\Omega}(s, w)$ in the paper. I checked the author's code and found they did the same thing.

But from Equation (2) and (3) in the paper, and the Q function update in the Algorithm 1, I see that we should compute the TD error between $Q_U(s,w,a)$ and $g_t^{(1)}$, and update $QU(s,w,a)$, not $Q{\Omega}(s, w)$.

If the underlying assumption is that $Q_{\Omega}(s,w)$ is equivalent as $QU(s,w,a)$ during the sampling process by Equation (1) in the paper, then we have to do sampling in an on-policy manner, not an off-policy manner, because $\pi{w, \theta}(a|s)$ in Equation 1 would be different between target policy and behavior policy. I would appreciate if you can share some thoughts on this.

https://github.com/lweitkamp/option-critic-pytorch/blob/fab40f7aae0ff45cf5945b7de79d5ae5446d31a0/option_critic.py#L215-L220

https://github.com/jeanharb/option_critic/blob/master/neural_net.py#L106-L117

lweitkamp commented 1 year ago

Right, so the authors mention that learning both $Q_\Omega$ and $QU$ is computationally wasteful and decide to learn only $Q\Omega$ and to derive an estimate of $Q_U$ from it. So in practice we completely omit learning $Q_U$ for high-dimensional state spaces.

For the option-policies, this makes no difference since adding a baseline is strictly variance reducing, and does not add a bias to the policy-gradient. For the policy-over-options I'm pretty sure that in expectation the way we learn $Q_\Omega$ remains the same, but It has been a while since I looked at the math. You might find an answer in the PhD thesis of Pierre-Luc Bacon (https://pierrelucbacon.com/bacon2018thesis.pdf) at section 5, which is the section concerning Option-Critic.

If you do find an answer to this, please let me know! 😀

tedhuang96 commented 1 year ago

I appreciate the information! Will try to dig and seek for the answer!

lweitkamp / option-critic-pytorch

Why not clean replay buffer after each episode for on-policy policy gradient update? #11