Is there any possibility that support the code for <The Advantage Regret-Matching Actor-Critic>?

dvstter commented 3 years ago

I am also one researcher focused on the EFG. I have read the preprint version of that paper. Is there any possibility that Lanctoc can support me with the implementation of this algorithm?

lanctot commented 3 years ago

I'd love nothing more than to have an implementation of ARMAC in OpenSpiel. It was also requested in the past, see here: https://github.com/deepmind/open_spiel/issues/441.

I was hoping to reimplement it as a side project to learn JAX, but I've been saying that for 6 months. I don't think I will be able to realistically get to it because I've got too many other things to do that are higher priority, but I would be happy to help by looking over you code if you implemented it and submitted it as a PR.

Thanks for your interest!

dvstter commented 3 years ago

I'd love nothing more than to have an implementation of ARMAC in OpenSpiel. It was also requested in the past, see here: #441.

I was hoping to reimplement it as a side project to learn JAX, but I've been saying that for 6 months. I don't think I will be able to realistically get to it because I've got too many other things to do that are higher priority, but I would be happy to help by looking over you code if you implemented it and submitted it as a PR.

Thanks for your interest!

Thanks for your reply, I will try to implement it by myself.

wbrenton commented 2 years ago

@lanctot is there still interest in a JAX implementation of ARMAC for OpenSpiel?

lanctot commented 2 years ago

@wbrenton

Absolutely! But it's not an easy algorithm to implement, so beware. Actually, if you know your way around JAX I can see it not being too bad.

@dvstter submitted a PyTorch PR: https://github.com/deepmind/open_spiel/pull/706. Unfortunately Audrunas never found the time to look through it, so we never imported it.

I would be happy to re-open the idea of importing ARMAC into OpenSpiel. However, if we do, it'd be good to see exploitability graphs comparable to the paper (or at least going down over time) on the benchmark games to be somewhat convinced that it's correct.

wbrenton commented 2 years ago

Fantastic! I'm pretty comfortable in JAX but I'll certainly reach out for some help to make sure the math is sound.

Looks like there is some good info in https://github.com/deepmind/open_spiel/issues/441, glad to have that. I was curious if the seminar mentioned in https://github.com/deepmind/open_spiel/issues/441#issuecomment-884574977 ever made it to your YouTube or elsewhere. Or if you or any of the other authors have ever given talks on ARMAC since.

Did you ever peak at @dvstter's implementation? Just want to get a feel for if it is the right direction or not. I need to look more into implementing the TreeBackup aspect but am interested in doing in a future iteration.

Would using https://github.com/deepmind/open_spiel/blob/b7ad42fbb3815a469c802eb563b6ca59ec5cd796/open_spiel/python/algorithms/exploitability.py#L154 plotted with epochs on x-axis be what your looking for?

Thanks in advance!

lanctot commented 2 years ago

Yes, the seminar is here: https://www.youtube.com/watch?v=ycQqCT7FFRw. I spent less than 5 min on ARMAC starting at 38:50. So it's an overview but it is a good place to start.

I did not look deeply at @dvstter's implementation because I thought Audrunas would and then I got super busy, but if there's renewed interest, I would be happy to help in both cases, as it would be really nice to finally have ARMAC in the code base. As I told them, I think you can start with DQN for the off-policy policy evaluation rather than TreeBackup.

Yes, NashConv plotted on a graph with the x-axis as the iterations is what I had in mind (like the graphs in the paper).

wbrenton commented 2 years ago

@lanctot could you clarify if something for me?

Do we use \overrightarrow{r} (that is cached where player i acts) as our \bar{W} target?

In the text to the left of the pseudocode it says that \bar{W} estimates A{\mu}. Whereas in the pseudocode the subscript \mu is missing (train \bar{W}{\theta^t} to predict A(h,a)

Based on https://github.com/deepmind/open_spiel/issues/441#issuecomment-918476921 I'm guessing we do use \overrightarrow{r} as target.

At the same time I'm having trouble reconciling \overrightarrow{r} being calculated with \theta^j where (as I mentioned above) the text to the left of the pseudocode describes \bar{W} estimating expected advantage calculated with the behavior params \mu

edit: I'll add... if we do use \overrightarrow{r}, what is the purpose of caching the behavior policy?

lanctot commented 2 years ago

Hi @wbrenton,

Yes indeed \vec{r} is the target for \bar{W}
The subscript \mu is missing because in this case q{\theta^j} is used to denote the function-approximated analogue (i.e. this is the critic that is learned off-policy via TreeBackup) at a specific epoch j. q\mu(h,a) or v\mu(h) are never computed exactly because it's intractable. So we estimated q\mu(h,a) - v_\mu(h) using neural networks (with parameters \theta). Note that \vec{r} is actually a function of the epoch that you sampled on this trajectory (j), so it uses the parameters for the networks (critics) at that specific epoch.
j is sampled uniformly. So think of \vec{r} as a sampled regret from the uniform distribution over CFR iterates -- at least that's true for opponent policies. Since we sample on-policy at opponent nodes, the expected regret will be proportional to the counterfactual reach probability. In particular, since \mu^t is fixed, the denominator will be the same for all actions, which makes \bar{W} different from \bar{R} but only by a scaling "constant" (a quantity that is a function of s only). See Equation 3 and surrounding text.
We don't cache the behavior policy \mu^t, indeed that would be pointless. We cache the actual regret-matching policy \pi^j(s). The reason is, we use it as a training target for the average policy \bar{\pi} at opponent nodes (last line of the inner loop on the learning steps further down in the algorithm)

These are great questions.. please let me know if anything was unclear!

wbrenton commented 2 years ago

Very helpful thank you!

wbrenton commented 2 years ago

Could you help me understand the bold in "a batch of 64 trajectories of length 32" look like for leduc poker? Longest action sequence to my understanding is (P1: C, P2: R, P1: R, P2: C or F). If your concatenating episodes to get length 32, was there any masking done in between episodes?

I'm writing this initial version for Leduc, then I'll iterate and expand the implementation to all applicable games.

lanctot commented 2 years ago

I believe the next two sentence answers how Audrunas did it:

In order to implement a full recall, all unfinished episodes will be continued on the next training iteration by propagating recurrent network states forward. Each time when one episode finishes at a particular batch entry, a new one is sampled and started to be unrolled from the beginning

So I don't think it even did any resetting of LSTM state between episodes, and the only thing different about the end of an episode is that there's a reward contained there. In most RL environments, the gamma is set to 0 when you hit the end of an episode to signify that the algorithm doesn't bootstrap across episode boundaries; I'm not sure if he did that, but I'll check.

lanctot commented 2 years ago

Just asked Audrunas: turns out he did both reset the LSTM at episode boundaries and set gamma = 0 for those terminal states (to prevent bootstrapping returns into the next episode).

I might suggest using an feed-forward net at first while checking correctness.

BTW one other thing, when training the "history-based critics", Audrunas concatenated the information state tensors (from both players' perspectives) as an input to represent the history.

wbrenton commented 2 years ago

Ok thank you for that info. Question regarding that last part: what is s_0 and s_1 in paragraph 4 of Appendix F then? I had interpreted it as each players infostate. If that's correct why concatenate them then immediately partition again? Or perhaps h1 is the concatenation operation your referring to? Am I missing something here?

Also the other thread on ARMAC had mentioned using a traverse game tree method to recursively generate episodes (similar to Deep CFR implementation). Being that this is an RL algorithm I'm using the RL API similar to NFSP implementation. Just thought I'd double check on that.

lanctot commented 2 years ago

Ok thank you for that info. Question regarding that last part: what is s_0 and s_1 in paragraph 4 of Appendix F then? I had interpreted it as each players infostate. If that's correct why concatenate them then immediately partition again? Or perhaps h1 is the concatenation operation your referring to? Am I missing something here?

That's correct. s_1 and s_2 are the info states from the perspective of two different players, for the same history h. (These have been called "augmented information states in the past". In OpenSpiel, you can get them via State::InformationStateTensore(Player player).

I do now understand what you mean when you say "immediately partition them again", can you elaborate? Maybe you mean why q_0 and q_1 are different outputs? That's because they represent different values (one to player 0 and the other to player 1)? I am answering the right question?

Also the other thread on ARMAC had mentioned using a traverse game tree method to recursively generate episodes (similar to Deep CFR implementation). Being that this is an RL algorithm I'm using the RL API similar to NFSP implementation. Just thought I'd double check on that.

That seems weird to me...? We quite strongly motivated ARMAC as a model-free method which distinguishes it from the standard Deep CFR (using external sampling) so yes please do it using the standard RL API like NFSP. (DREAM would be the more proper comparison since it's entirely sample-based).

wbrenton commented 2 years ago

All I'm saying by the "partition" comment is that if we concatenate s_0 and s_1 we need to partition that concat operation so we can perform a_0 = n_A(s_0) + n_B(s_1), a_1= n_B(s_0) + n_A(s_1). I'm guessing that the concat operation you referred to in "Audrunas concatenated the information state tensors (from both players' perspectives) as an input to represent the history" is in the definition of h1 = Concat(a_0, a_1).

Sorry for getting to deep into the semantics of this. If this is still unclear I can just get a first iteration done and you can take a look at the code where it's unambiguous.

lanctot commented 2 years ago

Oh!! Yes, I believe you're almost certainly right about that. :)

lanctot commented 2 years ago

Oh!! Yes, I believe you're almost certainly right about that. :)

Confirmed by Audrunas.

wbrenton commented 2 years ago

Okay I've got an implementation working and able to descend the exploitability from around 15 -> 1-2. Some thoughts/notes/questions below.

I will need some guidance on evaluating the "candidate policies" used for behavior/exploration. I understand the construction of the candidate pool fine. Where I'm uncertain is the "dynamic switching between policies", the process by which we evaluate the candidates. The paper references evaluating based on estimated return. My best guess is that you select based on which policy has the greatest expected value. Something like the code block at the bottom. Where that idea falls apart (at least how to implement it) is (if the different levels of epsilon are included in the evaluation process) how to include epsilon values into the EV calculation. I could be off here all together.
Unless I'm missing it, I don't see any details on the loss used for the mean policy head other than it's "also trained via a classification loss". What throws me is the "also" in that quote. It seems to be referring to the $\bar{W}$ head which is trained by $l2$, a regression loss. I feel that this is an important detail as the mean policy is what converges to nash. Here I use cross entropy.
Assuming that you do indeed use something like cross entropy, I'm curious to understand the intuition as to why? When I think of minimally exploitable strategies in the game of poker you want to have a healthy mix of bluffing vs. folding or chasing a flush/straight draw vs. folding (some stochasticity in your strategy). If we use a classification loss that tries to reduce the entropy in our distribution over actions we train the network to assign a high probability to one action (the target) and implicitly assigns low probability to all other actions. Doesn't this encourage the strategy to be deterministic? Why don't we want to learn the exact distribution produced by regret matching. Should I be using something like KL? What loss did Audrunas use?
How many episodes are generated before breaking and beginning the training phase of each epoch. This would be $K_{act}$ in the pseudocode.
This implementation is for Leduc, being that it is a smaller game I'm only using linear layers.
This critic, as you suggested, is using q-learning. I'm curious what affect this has on minimum achievable exploitability relative to using Tree-Backup.
In the experiments I've done so far, there's an unsettling level of variance in the exploitability convergence from trial to trial. I'm guessing this is due to a combination of not dynamically switching between exploratory policies and the randomness in trajectories added to the replay buffer in the beginning. Nonetheless, I have been able to reproduce getting to around 1-2 range.

Thanks for all the help on this. I sent an email to the address on your blog, thought I'd mention in case you no longer use that address.

candidate_policies = self._get_candidate_policies()
candidate_policies_ev = [np.sum(q_values_j * candidate_policy) for candidate_policy in candidate_policies]
behvaior_policy = candidate_policies[np.argmax(candidate_policies_ev)]

agruslys commented 2 years ago

Hi, thank you for your interest in AMRAC. I will try to answer the questions above:

The questions conflates data collection and evaluation. The switching and the candidate pool exists for exploratory purposes in order to generate diverse behaviours by running randomly mixed and matched past policies with different modulations (ie. different epsilons, temperatures etc). This is a separate problem from policy evaluation, which is done use off-policy policy evaluation (TB algorithm in this case, but it could also be any other).
Mean policy is trained using cross entropy loss. Theoretically it could also be trained using L2 loss although I would expect less stability.
Minimizing cross entropy loss at any given state (in a discrete case) makes the distribution of actions converge to an empirical distributions of actions taken in that state. If many policies were run in the past, it would converge to an empiriocal distributions of actions taken. In a neural network case also generalization kicks in. However, at no point would it converge to a deterministic distribution if several different policies had visited this state.
I do not remember, and the code is out of reach. I will try to find my notes.
It may work, but I am not sure if linear layers are expressive enough for the state representation that you are using. If you are using a one-hot representation then it is definitely enough. Otherwise the neural net should be large enough so that it could 'overfit': note that training the algorithm on Leduc is an overkill as it requires no generalization, so overfitting is fine.
In theory any off-policy policy learning algorithm should be fine asymptitically.
What is the learning rate that you are using?

agruslys commented 2 years ago

I would like to clarify regarding (6): Q learning is both a policy improvement and evaluation algorithm. You have to use an off-policy policy evaluation algorithm to evaluate (non-deterministic) policies produced by regret matching.

wbrenton commented 2 years ago

I have made a pull request apologies if I did anything incorrectly, this is my first. https://github.com/deepmind/open_spiel/pull/949#issue-1409602058

Regarding the critic loss, I wanted to check own understanding and document incase others have the question in the future. The reason we can not use q-learning is because it is a control algorithm. It aims to make $Q^{\pi} \rightarrow Q^*$. Whereas in ARMAC we are looking to only evaluate the policy i.e. find $Q{\pi}$. In ARMAC, the estimated advantages $A{\mu^t, i} = q{\mu^t, i}(h, a) - v{\mu^t, i}(h)$ serve as a target for the average regret head $\bar{W}$, which is used in the regret matching equation. In order for the relationship to CFR to hold, we only want the critic to tells us about the value of the current policy without implicitly improving it, as is done in q-learning with the argmax operation when producing the target.

@agruslys yesterday you had proposed $Q(h_t, a) = \suma Q(h{t+1}, a') \pi(h_{t+1}, a')$ to replace q-learning. I'm wondering where the reward and gamma come in. Should it be $Q(ht, a) = r{t} \gamma \suma Q(h{t+1}, a') * \pi(h_{t+1}, a')$? @lanctot what was the name you had called this? "sar see" or SAR-C?

lanctot commented 2 years ago

@wbrenton The algorithm I was referring to is called Expected Sarsa (which I've heard some RL friends call SARSE, pronounced "SAR-Sea" so that's what I had called it too). Here's a link to the paper: https://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/vanseijenadprl09.pdf (I don't see the name in that paper though! So maybe it was from an earlier one. Sorry for the confusion!)

Yes, that your version is almost right. I think that is what @agruslys had in mind and in the end I think indeed it is exactly Expected Sarsa (see equation 5 from the paper linked above).

google-deepmind / open_spiel

Is there any possibility that support the code for <The Advantage Regret-Matching Actor-Critic>? #597