google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.
Apache License 2.0
4.16k stars 917 forks source link

Question about algorithms for simultaneous move games (goofspiel in particular) #270

Closed andrschl closed 4 years ago

andrschl commented 4 years ago

Hi all!

In the course of a small project we are currently trying to find some nice algorithms for goofspiel. However, from what I've seen so far, algorithms that would need a LP-solver for the simultaneous case such as (star-) minimax, MCTS (https://dke.maastrichtuniversity.nl/m.winands/documents/wcg13-smmcts.pdf), value iteration, and Q-learning are currently not implemented. Since we don't want to end up with pure strategies, converting the game to a imperfect information sequential game does also not seem to make sense for those algorithms.

So shall we just focus on policy gradient methods in this case? Or does anyone have some experience with goofspiel (or similar simultaneous games) and knows which algorithms generally work well?

Thanks in advance for your suggestions and sorry if this is the wrong place for such a question.

Best, Andreas

lanctot commented 4 years ago

Hi @andrschl , the proper value iteration for simultaneous move games is implemented in python just not C++. The reason at the time was that I did not want to add a dependency to glpk. But now we have nice way to do conditional dependencies so we can revisit this. I would also like having MinimaxQ in the code base, which would be a simple extension. It is some work, but would be a very welcome contribution!

If you go this route I would like it if LP solving code (any wrappers around glpk) was self-contained so we could use it for other stuff (like we did on the python side basically).

If you don't go that route and mostly concerned about making sure the agent uses mixes strategies, you could add the optio of using Boltzmann exploration over q-values instead of epsilon-greedy to q-learning (aother welcome contribution btw!). Or I would use the turn_based_simultaneous_game wrapper and treat it like an imperfect information game and use any of NFSP / QPG / RPG, also DeepCFR after the reservoir buffer PR is merged (this week sometime).

WDYT?

andrschl commented 4 years ago

Hi @lanctot , thanks a lot for the quick answer and the suggestions! In this case we will try the value iteration for a small number of cards first. Also we are going to look at NFSP, RPG, and QPG.

If time permits I would also be very interested in implementing MinimaxQ. So if I understand this correctly you would prefer to have this in C++?

lanctot commented 4 years ago

Oh no I don't have a preference there, do whichever you prefer. Would be a lot easier in python since you don't have to reimplement the wrapper to the LP solvers, but if you plan to do that already then might be easy extension of what you are doing.

If you do it in python would be cool if it could support using function approximation and tabular, AFAIK that's never been done. And "Deep Minimax-Q" is a pretty awesome name for an algorithm, after all.. :-p But we can also.add that later as well.

andrschl commented 4 years ago

Ah ok, then we'll stick with python for the moment. Deep Minimax-Q indeed sounds really cool :) I would like to give it a try in the second have of the week.

By the way, is there an easy way to play with an RL agent (RPG, QPG) that is trained using the turn_based_simultaneous_game wrapper against an agent that uses the simultaneous node representation (e.g. the player from value iteration)?

lanctot commented 4 years ago

By the way, is there an easy way to play with an RL agent (RPG, QPG) that is trained using the turn_based_simultaneous_game wrapper against an agent that uses the simultaneous node representation (e.g. the player from value iteration)?

Ah there is nothing implemented, but it's a super easy change (that I would encourage you to contribute independently if you want).

The information state tensors (and observation tensors) of the wrapped game add a fixed number of bits to game's underlying tensors. See e.g. here: https://github.com/deepmind/open_spiel/blob/351e157c9804ae459cb5db37f345162603249070/open_spiel/game_transforms/turn_based_simultaneous_game.cc#L172

What you should do is refactor this to call an separate function called ConvertTensor that is not a member function, passing in the info it needs (whose "turn" it is and who the observer is). Then this method and the ObservationTensor would call this method instead. And you can expose the method too (use pybind11 to expose it to python) and then just call it to convert the observation before you do the lookup into your NFSP/RPG/QPG learned policies.

We should make that change, but if you want to get to the other stuff you can feel free to just do that converter on the python side too, a lot easier. If you do the suggestion above please consider submitting a PR for it as you surely won't be the only one to ever want this :) If you don't I'll do it eventually.

andrschl commented 4 years ago

Thanks a lot for this and sorry for my late answer. For time reasons we did it quickly on the python side. However, I started with (deep) Minimax-Q and plan to submit a PR for that as soon as it works (sometime next week hopefully) :)

lanctot commented 4 years ago

Great. Turns out I might add LP solving support on the C++ side myself soon, some time in the next month or so. I will keep you posted.

lanctot commented 4 years ago

Hi @andrschl I feel this is mostly resolved so I will close this now. Please feel free to reopen if you have more questions, and I will follow-up when I have added C++ LP code.

andrschl commented 4 years ago

Hi @lanctot, sorry for my late anwer! So far I have some first implementation of tabular minimax q. However, training didn't seem to be really working and then holidays came in between :) As soon as time permits I will review the code and if I manage to make it train properly I could submit a PR if this is still of interest.

lanctot commented 4 years ago

It is certainly still of interest! I also have LPs on the C++ side planned, but keeps getting bumped :)

lanctot commented 4 years ago

But there is no rush, and no expectations about time, so whenever you manage that is great. Would really love to see MinimaxQ in the algorithms, it's classic multiagent RL!

andrschl commented 4 years ago

Ok great. In this case I'll stay on it!