Closed dvstter closed 3 years ago
I'd love nothing more than to have an implementation of ARMAC in OpenSpiel. It was also requested in the past, see here: https://github.com/deepmind/open_spiel/issues/441.
I was hoping to reimplement it as a side project to learn JAX, but I've been saying that for 6 months. I don't think I will be able to realistically get to it because I've got too many other things to do that are higher priority, but I would be happy to help by looking over you code if you implemented it and submitted it as a PR.
Thanks for your interest!
I'd love nothing more than to have an implementation of ARMAC in OpenSpiel. It was also requested in the past, see here: #441.
I was hoping to reimplement it as a side project to learn JAX, but I've been saying that for 6 months. I don't think I will be able to realistically get to it because I've got too many other things to do that are higher priority, but I would be happy to help by looking over you code if you implemented it and submitted it as a PR.
Thanks for your interest!
Thanks for your reply, I will try to implement it by myself.
@lanctot is there still interest in a JAX implementation of ARMAC for OpenSpiel?
@wbrenton
Absolutely! But it's not an easy algorithm to implement, so beware. Actually, if you know your way around JAX I can see it not being too bad.
@dvstter submitted a PyTorch PR: https://github.com/deepmind/open_spiel/pull/706. Unfortunately Audrunas never found the time to look through it, so we never imported it.
I would be happy to re-open the idea of importing ARMAC into OpenSpiel. However, if we do, it'd be good to see exploitability graphs comparable to the paper (or at least going down over time) on the benchmark games to be somewhat convinced that it's correct.
Fantastic! I'm pretty comfortable in JAX but I'll certainly reach out for some help to make sure the math is sound.
Looks like there is some good info in https://github.com/deepmind/open_spiel/issues/441, glad to have that. I was curious if the seminar mentioned in https://github.com/deepmind/open_spiel/issues/441#issuecomment-884574977 ever made it to your YouTube or elsewhere. Or if you or any of the other authors have ever given talks on ARMAC since.
Did you ever peak at @dvstter's implementation? Just want to get a feel for if it is the right direction or not. I need to look more into implementing the TreeBackup aspect but am interested in doing in a future iteration.
Would using https://github.com/deepmind/open_spiel/blob/b7ad42fbb3815a469c802eb563b6ca59ec5cd796/open_spiel/python/algorithms/exploitability.py#L154 plotted with epochs on x-axis be what your looking for?
Thanks in advance!
Yes, the seminar is here: https://www.youtube.com/watch?v=ycQqCT7FFRw. I spent less than 5 min on ARMAC starting at 38:50. So it's an overview but it is a good place to start.
I did not look deeply at @dvstter's implementation because I thought Audrunas would and then I got super busy, but if there's renewed interest, I would be happy to help in both cases, as it would be really nice to finally have ARMAC in the code base. As I told them, I think you can start with DQN for the off-policy policy evaluation rather than TreeBackup.
Yes, NashConv plotted on a graph with the x-axis as the iterations is what I had in mind (like the graphs in the paper).
@lanctot could you clarify if something for me?
Do we use \overrightarrow{r} (that is cached where player i acts) as our \bar{W} target?
In the text to the left of the pseudocode it says that \bar{W} estimates A{\mu}. Whereas in the pseudocode the subscript \mu is missing (train \bar{W}{\theta^t} to predict A(h,a)
Based on https://github.com/deepmind/open_spiel/issues/441#issuecomment-918476921 I'm guessing we do use \overrightarrow{r} as target.
At the same time I'm having trouble reconciling \overrightarrow{r} being calculated with \theta^j where (as I mentioned above) the text to the left of the pseudocode describes \bar{W} estimating expected advantage calculated with the behavior params \mu
edit: I'll add... if we do use \overrightarrow{r}, what is the purpose of caching the behavior policy?
Hi @wbrenton,
Yes indeed \vec{r} is the target for \bar{W}
The subscript \mu is missing because in this case q{\theta^j} is used to denote the function-approximated analogue (i.e. this is the critic that is learned off-policy via TreeBackup) at a specific epoch j. q\mu(h,a) or v\mu(h) are never computed exactly because it's intractable. So we estimated q\mu(h,a) - v_\mu(h) using neural networks (with parameters \theta). Note that \vec{r} is actually a function of the epoch that you sampled on this trajectory (j), so it uses the parameters for the networks (critics) at that specific epoch.
j is sampled uniformly. So think of \vec{r} as a sampled regret from the uniform distribution over CFR iterates -- at least that's true for opponent policies. Since we sample on-policy at opponent nodes, the expected regret will be proportional to the counterfactual reach probability. In particular, since \mu^t is fixed, the denominator will be the same for all actions, which makes \bar{W} different from \bar{R} but only by a scaling "constant" (a quantity that is a function of s only). See Equation 3 and surrounding text.
We don't cache the behavior policy \mu^t, indeed that would be pointless. We cache the actual regret-matching policy \pi^j(s). The reason is, we use it as a training target for the average policy \bar{\pi} at opponent nodes (last line of the inner loop on the learning steps further down in the algorithm)
These are great questions.. please let me know if anything was unclear!
Very helpful thank you!
Could you help me understand the bold in "a batch of 64 trajectories of length 32" look like for leduc poker? Longest action sequence to my understanding is (P1: C, P2: R, P1: R, P2: C or F). If your concatenating episodes to get length 32, was there any masking done in between episodes?
I'm writing this initial version for Leduc, then I'll iterate and expand the implementation to all applicable games.
I believe the next two sentence answers how Audrunas did it:
In order to implement a full recall, all unfinished episodes will be continued on the next training iteration by propagating recurrent network states forward. Each time when one episode finishes at a particular batch entry, a new one is sampled and started to be unrolled from the beginning
So I don't think it even did any resetting of LSTM state between episodes, and the only thing different about the end of an episode is that there's a reward contained there. In most RL environments, the gamma is set to 0 when you hit the end of an episode to signify that the algorithm doesn't bootstrap across episode boundaries; I'm not sure if he did that, but I'll check.
Just asked Audrunas: turns out he did both reset the LSTM at episode boundaries and set gamma = 0 for those terminal states (to prevent bootstrapping returns into the next episode).
I might suggest using an feed-forward net at first while checking correctness.
BTW one other thing, when training the "history-based critics", Audrunas concatenated the information state tensors (from both players' perspectives) as an input to represent the history.
Ok thank you for that info. Question regarding that last part: what is s_0 and s_1 in paragraph 4 of Appendix F then? I had interpreted it as each players infostate. If that's correct why concatenate them then immediately partition again? Or perhaps h1 is the concatenation operation your referring to? Am I missing something here?
Also the other thread on ARMAC had mentioned using a traverse game tree method to recursively generate episodes (similar to Deep CFR implementation). Being that this is an RL algorithm I'm using the RL API similar to NFSP implementation. Just thought I'd double check on that.
Ok thank you for that info. Question regarding that last part: what is s_0 and s_1 in paragraph 4 of Appendix F then? I had interpreted it as each players infostate. If that's correct why concatenate them then immediately partition again? Or perhaps h1 is the concatenation operation your referring to? Am I missing something here?
That's correct. s_1 and s_2 are the info states from the perspective of two different players, for the same history h. (These have been called "augmented information states in the past". In OpenSpiel, you can get them via State::InformationStateTensore(Player player)
.
I do now understand what you mean when you say "immediately partition them again", can you elaborate? Maybe you mean why q_0 and q_1 are different outputs? That's because they represent different values (one to player 0 and the other to player 1)? I am answering the right question?
Also the other thread on ARMAC had mentioned using a traverse game tree method to recursively generate episodes (similar to Deep CFR implementation). Being that this is an RL algorithm I'm using the RL API similar to NFSP implementation. Just thought I'd double check on that.
That seems weird to me...? We quite strongly motivated ARMAC as a model-free method which distinguishes it from the standard Deep CFR (using external sampling) so yes please do it using the standard RL API like NFSP. (DREAM would be the more proper comparison since it's entirely sample-based).
All I'm saying by the "partition" comment is that if we concatenate s_0 and s_1 we need to partition that concat operation so we can perform a_0 = n_A(s_0) + n_B(s_1), a_1= n_B(s_0) + n_A(s_1). I'm guessing that the concat operation you referred to in "Audrunas concatenated the information state tensors (from both players' perspectives) as an input to represent the history" is in the definition of h1 = Concat(a_0, a_1).
Sorry for getting to deep into the semantics of this. If this is still unclear I can just get a first iteration done and you can take a look at the code where it's unambiguous.
Oh!! Yes, I believe you're almost certainly right about that. :)
Oh!! Yes, I believe you're almost certainly right about that. :)
Confirmed by Audrunas.
Okay I've got an implementation working and able to descend the exploitability from around 15 -> 1-2. Some thoughts/notes/questions below.
Thanks for all the help on this. I sent an email to the address on your blog, thought I'd mention in case you no longer use that address.
candidate_policies = self._get_candidate_policies()
candidate_policies_ev = [np.sum(q_values_j * candidate_policy) for candidate_policy in candidate_policies]
behvaior_policy = candidate_policies[np.argmax(candidate_policies_ev)]
Hi, thank you for your interest in AMRAC. I will try to answer the questions above:
The questions conflates data collection and evaluation. The switching and the candidate pool exists for exploratory purposes in order to generate diverse behaviours by running randomly mixed and matched past policies with different modulations (ie. different epsilons, temperatures etc). This is a separate problem from policy evaluation, which is done use off-policy policy evaluation (TB algorithm in this case, but it could also be any other).
Mean policy is trained using cross entropy loss. Theoretically it could also be trained using L2 loss although I would expect less stability.
Minimizing cross entropy loss at any given state (in a discrete case) makes the distribution of actions converge to an empirical distributions of actions taken in that state. If many policies were run in the past, it would converge to an empiriocal distributions of actions taken. In a neural network case also generalization kicks in. However, at no point would it converge to a deterministic distribution if several different policies had visited this state.
I do not remember, and the code is out of reach. I will try to find my notes.
It may work, but I am not sure if linear layers are expressive enough for the state representation that you are using. If you are using a one-hot representation then it is definitely enough. Otherwise the neural net should be large enough so that it could 'overfit': note that training the algorithm on Leduc is an overkill as it requires no generalization, so overfitting is fine.
In theory any off-policy policy learning algorithm should be fine asymptitically.
What is the learning rate that you are using?
I would like to clarify regarding (6): Q learning is both a policy improvement and evaluation algorithm. You have to use an off-policy policy evaluation algorithm to evaluate (non-deterministic) policies produced by regret matching.
I have made a pull request apologies if I did anything incorrectly, this is my first. https://github.com/deepmind/open_spiel/pull/949#issue-1409602058
Regarding the critic loss, I wanted to check own understanding and document incase others have the question in the future. The reason we can not use q-learning is because it is a control algorithm. It aims to make $Q^{\pi} \rightarrow Q^*$. Whereas in ARMAC we are looking to only evaluate the policy i.e. find $Q{\pi}$. In ARMAC, the estimated advantages $A{\mu^t, i} = q{\mu^t, i}(h, a) - v{\mu^t, i}(h)$ serve as a target for the average regret head $\bar{W}$, which is used in the regret matching equation. In order for the relationship to CFR to hold, we only want the critic to tells us about the value of the current policy without implicitly improving it, as is done in q-learning with the argmax operation when producing the target.
@agruslys yesterday you had proposed $Q(h_t, a) = \suma Q(h{t+1}, a') \pi(h_{t+1}, a')$ to replace q-learning. I'm wondering where the reward and gamma come in. Should it be $Q(ht, a) = r{t} \gamma \suma Q(h{t+1}, a') * \pi(h_{t+1}, a')$? @lanctot what was the name you had called this? "sar see" or SAR-C?
@wbrenton The algorithm I was referring to is called Expected Sarsa (which I've heard some RL friends call SARSE, pronounced "SAR-Sea" so that's what I had called it too). Here's a link to the paper: https://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/vanseijenadprl09.pdf (I don't see the name in that paper though! So maybe it was from an earlier one. Sorry for the confusion!)
Yes, that your version is almost right. I think that is what @agruslys had in mind and in the end I think indeed it is exactly Expected Sarsa (see equation 5 from the paper linked above).
I am also one researcher focused on the EFG. I have read the preprint version of that paper. Is there any possibility that Lanctoc can support me with the implementation of this algorithm?