Implementation of approximate exploitability

lilith829 commented 3 years ago

Hi,

Is there any implementation of approximate exploitability, like the one mentioned in the paper: https://ala2020.vub.ac.be/papers/ALA2020_paper_43.pdf It seems that in the example code, the full game tree traversal is needed to calculate exploitability. Please let me know if I misunderstood anything. Thanks.

lanctot commented 3 years ago

Hi @lilith829, tagging @finbarrtimbers who has been working on this.

The code we have for that work (updated paper here, btw: https://arxiv.org/abs/2004.09677) uses a fair bit of internal code that is not easy to release.

However, I have started a basic one that makes the necessary modifications to IS-MCTS (to play against a fixed policy). It's currently only a prototype because it does everything using a table (even the learning part) so cannot directly be used with neural-nets in an AlphaZero-style learning loop for large games yet. I had not released it yet because there's still some finishing touches I need do, but to get to the full version would require some work. I would be inclined to finally finish if it would be useful to you.

The best option currently is to run DQN against a fixed policy. We don't have an example of that but we can make one too. For now you can look at breakthrough_dqn.py and just modify it to play against fixed policies.

lilith829 commented 3 years ago

Thank you @lanctot. Yes, some examples using IS-MCTS against fixed policy would be helpful. For DQN approach, Do you mean to replace random bots with fixed policy (in eval_against_random_bots)? Is the returned value equivalent to the exploitability? Thanks.

lanctot commented 3 years ago

Thank you @lanctot. Yes, some examples using IS-MCTS against fixed policy would be helpful.

Ok there's a bit of work to do on this before I can release it, might be a week or two. I'll keep you posted.

For DQN approach, Do you mean to replace random bots with fixed policy (in eval_against_random_bots)? Is the returned value equivalent to the exploitability? Thanks.

Yes that is what I meant but no, it is not equivalent to exploitability. It is only a lower bound on exploitability. But at least it is something that is possible to get in larger games (and it is starting to be used in papers, e.g. see DREAM). I have an adaptation of this already that I can easily submit for the next update (Monday).

lilith829 commented 3 years ago

Ok there's a bit of work to do on this before I can release it, might be a week or two. I'll keep you posted.

Great. Thank you.

Yes that is what I meant but no, it is not equivalent to exploitability. It is only a lower bound on exploitability. But at least it is something that is possible to get in larger games (and it is starting to be used in papers, e.g. see DREAM). I have an adaptation of this already that I can easily submit for the next update (Monday).

Sounds great. As an exercise, I tried to run the example that runs DQN agents against random agents. For breakthrough game, the reward for both players are approached to 1 quickly. But when I switch the game to kuhn_poker without other changes, the reward doesn't seem to converge and not increase by episodes (even at 1M episodes). My understanding is that with more training epochs, the rewards should be increasing and converge to some value, right? Do you know what could be the reason? Thanks.

lanctot commented 3 years ago

Sounds great.

I wrote this today, btw, so will be available on the next update (Monday).

As an exercise, I tried to run the example that runs DQN agents against random agents. For breakthrough game, the reward for both players are approached to 1 quickly. But when I switch the game to kuhn_poker without other changes, the reward doesn't seem to converge and not increase by episodes (even at 1M episodes). My understanding is that with more training epochs, the rewards should be increasing and converge to some value, right? Do you know what could be the reason?

No, that's not true. In this class of games, standard RL (i.e. DQN in self-play) is not necessarily doing strict policy improvement. it The optimal policy may be stochastic, and DQN is built on top of Q-learning for MDPs, and in the case of Kuhn poker the Markov assumption which no longer holds. So you should not expect to get monotically better over time. So, there's no guarantee on how the policies will change, and there can be intransitivities (meaning, there's no guarantee that the policies will strictly do better than uniform random over time either). Even if you use an algorithm that gets closer to a Nash equilibrium, there is no guarantee that it will strictly improve against uniform random, because "optimal" here means getting closer to a Nash equilibrium and not beating some specific strategy more than before.

Some good places to start:

lilith829 commented 3 years ago

Thanks for the detailed explanation. I understand better about the results. But I still don't know the reason behind using DQN.

Why does playing against DQN agent get the lower bound of exploitability? Since DQN agent does not necessarily play the best response policy, the reward of playing against a DQN policy is higher than that of playing against best response policy. In that case, shouldn't that be upper bound?
For poker game, can I still use DQN for exploitability approximation? Since poker is not MDP, the DQN learned policy is unlikely to be close to best response policy over time. In that case, what's the meaning of reward when playing against DQN agent? If I play policy A against DQN policy and get higher reward than play policy B, can I conclude that policy A is stronger than policy B in this game?

Thank you.

finbarrtimbers commented 3 years ago

1) We use DQN here as it's a really simple agent that is inexpensive to run. It is not particularly strong. If anything, it's likely to be relatively weak.

2) It's a lower bound because other agents could do better. Another way of wording it is that the exact best response is going to be at least as good as the approximate best response learned by DQN here. If it was an upper bound, we'd be saying the opposite: That the exact best response would be, in some cases, worse than the approximation learned here. This is clearly not possible, as the best response will be strictly better than the approximation learned here.

3) You totally can, but it's possible for it to be a bad lower bound. I know that ReBeL 0 used a variant of DQNBR (i.e. the approximate best response, or ABR, implementation that uses DQN as the BR) that used double DQN and they got strong results. In our experiments, DQNBR has done quite well in poker when set up appropriately. We had to add some additional features with our implementation to get it to be a strong approximation; I'm not sure how Marc's implementation does in poker.

I'll answer your question in terms of "approximate exploitability", which is equal to the average reward that DQN receives when playing against a fixed policy.

If you calculate the approximate exploitability of two policies, and A has a lower approximate exploitability than B, it's not conclusive proof that A is less exploitable than B, but it is a strong indication. So as long as you are careful about what you claim, I think you could make that statement. I would take a look at what Facebook claimed in the ReBeL paper, or what I claimed in the ABR paper 1; if that language is unclear, let me know and I'll try to reword it, as I go into depth discussing this there.

This is why we spend so much time in the ABR paper validating the ISMCTSBR agent; the strength of the agent is directly proportional to the accuracy of the approximation, so if you can prove that your agent is able to approximate the BR in a variety of settings where the BR is known, you can have confidence in how accurate the approximation is in settings where you don't know the exact BR value.

Finally, as I say in the ABR paper, there is no reason to use an approximate version of exploitability in games like Kuhn Poker, where the exact BR is tractable to compute. It is strictly better, and I would encourage you to use it. Approximate exploitability is only useful in settings like Heads Up No Limit Hold'em, where the exact BR is too expensive to compute.

finbarrtimbers commented 3 years ago

I should also note that ABR is a bit of a proof by counterexample situation; if ABR manages to exploit your agent, then that's a strong indication that your agent is weak, but if ABR doesn't manage to exploit your agent, it's tough to draw any definitive conclusions.

lanctot commented 3 years ago

Finbarr's answer is really great, I just wanted to touch on a specific point:

* For poker game, can I still use DQN for exploitability approximation? Since poker is not MDP

When you fix the opponents' policies and only have one agent learning, then it is an MDP (the fixed policies are part of the environment), and so it's ok to use RL to look for approximate best responses, even though it's an imperfect information game. More on why in Finbarr's paper and also I realized I forgot to mention PSRO which uses this fact very strongly in its approach: https://arxiv.org/abs/1711.00832 see also especially Greenwald et al "Solving for Best Responses and Equilibria in Extensive-Form Games with Reinforcement Learning Methods" where they go into a lot more detail.

The problem crops up when agents are learning together (which is what you were doing in Breakthrough and Kuhn). It is not so much a problem in practice in perfect information games because the quality of the policy does not depend so strongly on information that you don't have (like hidden cards) and figuring out the hidden information is not as critical to playing well.

The file I've added, rl_response.py, which you will see in Monday's sync to github, uses DQN to learn a response versus a fixed policy.

lilith829 commented 3 years ago

Thanks Finbarr and Marc for the explanation and clarification. I will spend some time digging into the details you guys share. I am looking forward to trying the rl_response.py.

When you fix the opponents' policies and only have one agent learning, then it is an MDP (the fixed policies are part of the environment), and so it's ok to use RL to look for approximate best responses, even though it's an imperfect information game.

I see. If I set the fixed policy to be the optimal policy in rl_response.py and play Kuhn poker, the DQN agent will eventually reach Nash equilibrium or close, right? When we use DQN to approximate the exploitability of game like HUNL, it would probably be okay, but I would just get an exploitability lower than the true exploitability, right?

there is no reason to use an approximate version of exploitability in games like Kuhn Poker, where the exact BR is tractable to compute. It is strictly better, and I would encourage you to use it. Approximate exploitability is only useful in settings like Heads Up No Limit Hold'em, where the exact BR is too expensive to compute.

Thanks for the information. Indeed, I use Kuhn poker as a toy example to try out the DQN agent strength. The initial game I used was heads-up limited poker which has 10^14 states. But it has OOM error when computing exploitability. The code first converts the callable policy to tabular policy which is too big to be put in the memory. Computing the exploitability of the game of this size should be doable as long as the tabular policy is not to be fitted into the memory entirely, but computed on-the-fly while doing tree traversal. Is it working?

lanctot commented 3 years ago

Thanks Finbarr and Marc for the explanation and clarification. I

When you fix the opponents' policies and only have one agent learning, then it is an MDP (the fixed policies are part of the environment), and so I see. If I set the fixed policy to be the optimal policy in rl_response.py and play Kuhn poker, the DQN agent will eventually reach Nash equilibrium or close, right?

No, it will converge to a(n approxinate) best response which is not necessarily Nash. To see this think of Rock, Paper, Scissors. An equilibrium mixed strategy is the 1/3, 1/3, 1/3. Rock is a best response to this, but rock is not an equilibrium strategy.

When we use DQN to approximate the exploitability of game like HUNL, it would probably be okay, but I would just get an exploitability lower than the true exploitability, right?

Yes exactly, it is only a lower bound.

Thanks for the information. Indeed, I use Kuhn poker as a toy example to try out the DQN agent strength. The initial game I used was heads-up limited poker which has 10^14 states. But it has OOM error when computing exploitability. The code first converts the callable policy to tabular policy which is too big to be put in the memory. Computing the exploitability of the game of this size should be doable as long as the tabular policy is not to be fitted into the memory entirely, but computed on-the-fly while doing tree traversal. Is it working?

All of our exploitability algorithms require the full policy to be in memory. The best bet for large games is DQN in rl_response.py

lanctot commented 3 years ago

Hi @lilith829 , it is now in: https://github.com/deepmind/open_spiel/commit/9122a9bf520cfd2dcfc0ac2e8f0c63811a0b6e15

finbarrtimbers commented 3 years ago

Computing the exploitability of the game of this size should be doable as long as the tabular policy is not to be fitted into the memory entirely, but computed on-the-fly while doing tree traversal. Is it working?

The tree traversal code in question (which I wrote) was optimized for speed, not memory usage, so it keeps a ton of stuff in memory. While it is technically feasible to calculate a BR in limit poker, our implementation would need drastic changes to do so. I also strongly suspect you'd have to distribute the BR calculation across multiple machines to make it happen in a reasonable amount of time, while our implementation is single machine only.

This is part of why we made ABR, as ISMCTSBR/DQNBR scale very nicely.

google-deepmind / open_spiel

Implementation of approximate exploitability #519