Open tedhuang96 opened 1 year ago
Hey, good question. As I recall, the policy-over-options is actually off-policy, whereas the intra-policies are calculated on-policy. We only use the replay buffer to update the policy-over-options. I believe they mention that the target g_t is indeed off-policy (text box above Algorithm 1). In the codebase you can see that we only use the replay buffer to update the policy-over-options:
Does that answer your question?
Thank you for your prompt reply! That clarifies my confusion about the use of replay buffer.
I have another question popping up while reading the code. In the calculation of critic loss function, I see you are computing the TD error between $gt^{(1)}$ and $Q{\Omega}(s, w)$ in the paper. I checked the author's code and found they did the same thing.
But from Equation (2) and (3) in the paper, and the Q function update in the Algorithm 1, I see that we should compute the TD error between $Q_U(s,w,a)$ and $g_t^{(1)}$, and update $QU(s,w,a)$, not $Q{\Omega}(s, w)$.
If the underlying assumption is that $Q_{\Omega}(s,w)$ is equivalent as $QU(s,w,a)$ during the sampling process by Equation (1) in the paper, then we have to do sampling in an on-policy manner, not an off-policy manner, because $\pi{w, \theta}(a|s)$ in Equation 1 would be different between target policy and behavior policy. I would appreciate if you can share some thoughts on this.
https://github.com/jeanharb/option_critic/blob/master/neural_net.py#L106-L117
Right, so the authors mention that learning both $Q_\Omega$ and $QU$ is computationally wasteful and decide to learn only $Q\Omega$ and to derive an estimate of $Q_U$ from it. So in practice we completely omit learning $Q_U$ for high-dimensional state spaces.
For the option-policies, this makes no difference since adding a baseline is strictly variance reducing, and does not add a bias to the policy-gradient. For the policy-over-options I'm pretty sure that in expectation the way we learn $Q_\Omega$ remains the same, but It has been a while since I looked at the math. You might find an answer in the PhD thesis of Pierre-Luc Bacon (https://pierrelucbacon.com/bacon2018thesis.pdf) at section 5, which is the section concerning Option-Critic.
If you do find an answer to this, please let me know! 😀
I appreciate the information! Will try to dig and seek for the answer!
Thanks for providing the pytorch version of option critic. I want to ask why don't we clean replay buffer after each episode for on-policy policy gradient update? I think both algorithm 1 in the paper and the derivation for the intra-option policy gradient theorem is done under the setting of on-policy setup. If we do not clean the replay buffer, importance sampling should be implemented to account for the off-policy update. But I did not see any part of code related to that. I tried to read the original Theano repo, but it seems that they did the same thing. Do you have any comments on this?