alphaZero implementation ideas and passing state to rl_agent

danielwillemsen commented 5 years ago

First of all, thanks for making this framework open source.

I’m investigating the possibility of making a (simplified) alphaZero implementation using openspiel, and I was looking for some implementation ideas, especially since you already mention this in the contributors guide.

Please note: I am not sure if I will have the time to make the code up to openspiel standards. Also I might not very closely follow the alphazero pseudo-code. Thus, I am uncertain sure whether this effort will eventually result in a pull request. I still think some pointers would be very helpful, since others might be going to work on similar algorithms.

Implementation wise, it seems most logical to me to create an _rlagent implementation called alphaZero. When taking a step however; the agent will perform a MCTS. To do this, a complete game state will have to be reconstructed. The easiest way to do this would be to pass the current environment state as an argument to the step() function of the _rlagent and then creating a game with this state internally in the _rlagent. This feels hacky to me: instead of using the _timestep argument, which was seemingly designed to provide all available information to the agent, you are feeding it extra with the full game state (of course, in a perfect information game this would be available already anyways).

What would be your perspective on this topic? Of course, any other design advice on the implementation would be very welcome as well.

tl;dr: How to do state reconstruction in rl_agent? Do you have design advice on the implementation of an alphaZero-like algorithm?

elkhrt commented 5 years ago

Thanks for your interest! You're right that the AlphaZero algorithm doesn't conform to the rl_agent API, since it makes use of a simulator for the game. It would be more natural to implement an AlphaZero actor as a bot (see python/bots or spiel_bots.h) with a separate learning process.

Since you mention perfect information, it might be worth pointing out that the AlphaZero algorithm is only designed for games of perfect information - it won't learn low-exploitability stochastic policies, and it doesn't have memory.

Happy to discuss further!

jblespiau commented 5 years ago

Thanks!

I just wanted to point out a few links, in case you may not be aware of them. There is the https://github.com/leela-zero/leela-zero project which reproduced the AlphaZero results on Go (One of their challenges was to find the compute power to generate enough games, and they crowd-sourced it, pretty impressive).

For a smaller scale, simpler version, I do not looked at them, but a rapid search showed there are already a few tutorials/implementations out there: https://github.com/suragnair/alpha-zero-general (it happens I could speak with one of the authors. Even though it is named AlphaZero, it is an AlphaGo implementation, for a course university project.) https://towardsdatascience.com/alphazero-implementation-and-tutorial-f4324d65fdfc https://towardsdatascience.com/from-scratch-implementation-of-alphazero-for-connect4-f73d4554002a (pytorch only I think)

So I imagine that depending on you goal (learning more about AlphaZero, implementing it from scratch, getting something that works on small games, including it in OpenSpiel), the process can be different, but this may be helpful if you are looking to learn more about it. (If an implementation is derived from any of outside project, we must take care that the licence allow us to get a modified version in).

lanctot commented 5 years ago

Sorry I'm late replying on this, thanks for your interest.

I think ideally what we'd like is a very simple implementation -- maybe enough to validate it on Connect Four, by reproducing the work showcased in this series of blog posts: https://medium.com/@sleepsonthefloor/azfour-a-connect-four-webapp-powered-by-the-alphazero-algorithm-d0c82d6f3ae9

It's a rather large undertaking -- at least a course project's worth of work. I think a first cut done all in python with absolutely no bells or whistles would be fine (just enough to convince ourselves that it is indeed correct). We could, e.g., optimize this by dropping down to C++ for the search + inference later. It's also nice if the reference implementation is very bare, because we would want it to be a stepping stone to build and learn from, so the simpler the better.

This is why I like the idea of just following exactly what was done in the blog post above, because (hopefully) the hyperparameters and architecture could be copied and there should be no time lost to not being sure whether it's the wrong set of hyper-parameters versus whether there is a bug.

All this said: it doesn't have to be one person that does everything! Feel free to start building the scaffolding, and either we will help or guide building on top of it. Maybe we could even start that scaffolding (which would be similar to the pseudo-code in the paper) and encourage a team effort where people fill out the bits and pieces.

I should mention: an implementation of AlphaZero is the top-requested feature (I have at least 4 requests for it so far), so it would be really great if we discussed a plan to make this happen. It would be great to loop in the AlphaZero team members, as I am sure they would have excellent advice on this too!

lanctot commented 5 years ago

@danielwillemsen I just noticed you have separate implementation of AlphaZero for Connect Four that you're already working on(!).

Briefly looking through the code, it's very similar to the "basic"/"vanilla" AlphaZero that I was thinking about and suggesting above.

How well is it working? Were you aware of the blog post series I linked above?

From what I can tell, it would not be too difficult to do something very similar within OpenSpiel... though I admit I took a very quick look. What do you think?

danielwillemsen commented 5 years ago

Hi all,

thanks for all the replies, sorry for the late reply from my side. In general, as @lanctot noticed, I was already working on a simple alphaZero implementation. The main goal for this is to create a simplified implementation which would be easy to modify, both in terms of algorithm as well as in terms of the game, to try some experiments with. When OpenSpiel got released, this opened up the possibility to use openspiel game implementations. This reliefs me from having to implement different games. In addition, openspiel's bots provide some easy baselines to compare the algorithm against.

So over the past week, I have spent some time to convert my alphaZero implementation into an openspiel compatible one (currently on a separate branch in my https://github.com/danielwillemsen/alphazero-connect4 repo). Since the the alphaZero implementation itself was not yet finished, there is still quite a bit of work to be done before I can give any comments on its performance, but the basic idea seems to be working, edit for clarity: "basic idea seems to be working" meaning: quickly learns to outperform random bots and after some training also weak MCTS bots. Further evaluation has not been done yet.

@jblespiau Thanks for the many links! Some of them I had already taken a good look at before, but others were new to me, thanks a lot!

@locked-deepmind Thanks for the help. I agree that a bot seems to be the most logical way to implement an "alphazero-player", with a seperate learning script, which is also how I am currently doing it.

@lanctot Thanks for sharing the blog, I indeed have looked at those before, and my implementation looks relatively similar (their implementation does not provide code though, thus it does leave some guesswork), although it misses many of the performance optimizations they have done. I might still try to do some of them, depending on the amount of work/expected improvements. I agree that starting with their hyper-parameters is a good idea.

Some context of my stance in this implementation and my future goals for it: This started out mostly as a project to do just for fun, but as of September I have started an internship on the topic at "Centrum Wiskunde & Informatica". As I stated before, my main goal is to have an easy to work with implementation that we could use to do small experiments and some research on.

Thus, a good working implementation is important to me, however; I am unsure how much time I will be able to spend myself on making the implementation more optimized once it is "finished", and I will have to discuss these priorities with my colleagues.

Once more, thanks for all your help!

ronvree commented 5 years ago

Hello!

A while ago I also created my own AlphaZero implementation. Feel free to look at it at https://github.com/ronvree/AlphaZero. Hope it can be of any help!

About the implementation:

So far I have been able to train a Connect 4 agent that works pretty well
It should be easy to extend it to other games, models and frameworks with little effort (so far I mostly used pytorch)

If you have any questions regarding the implementation, don't hesitate to ask!

lanctot commented 5 years ago

@ronvree thanks!

@danielwillemsen : great! At this point, I am not so interested in the optimizations or performance as much as I am interested in just being confident that it is (i) simple, and (ii) correct. We or the community can improve it later, but would would be great is something we know is working. And you would not necessarily be the one to have to do it (we would not expect that from you). The pointing to the blog post was mainly as a way to help ensuring the correctness (maybe also comparing with @ronvree's implementation would help here too).

We can also help with any questions of course, don't hesitate to ask.

As a side note, BTW, I knew a few people at CWI (I did my post-doc at Maastricht University): Michael Kaisers and Hendrik Baeir. It would be really cool if you could use it for research too!

danielwillemsen commented 5 years ago

@ronvree thanks for sharing!

@lanctot The world is such a small place. Hendrik Baier and Michael Kaisers happen to be my internship supervisors. So I have a lot of really great help and advice here already. However; if I have any specific questions about OpenSpiel, I will be sure to reach out to you.

In the meantime, I will keep you guys updated on my progress. Thanks again for your help!

sbodenstein commented 5 years ago

@danielwillemsen: I wanted to also tackle this project, and saw that you've started. What is your status? Do you still plan to integrate your work into OpenSpiel?

danielwillemsen commented 4 years ago

Hi, thanks for your interest!

My implementation is currently still on a separate repository. In short, the implementation works with openspiel and I have been using it myself quite a bit.

It can be used for different games in openspiel, I have tested it for connect-four and breakthrough (6x6).
It uses pytorch (and currently a resnet architecture is implemented)
It plays multiple games in parallel, with the possibility of using multiple GPU's. Although it does not reach a very high GPU utilization (my python MCTS implementation is fairly slow).

The main downside is that the parallelization has made the code quite a bit more complicated and messy. In general, the whole code quality is poor.

I am still hoping to find some time to create a clean and stripped down version of it, which I would still like to integrate into openspiel. Realistically, I won't be able to finish this on my own for another month or two at least, as I am currently focused on my research, but I could find some time to work on it.

If you would be interested in working on this together, I can put a minimalistic version without all the parallelization in an openspiel fork, and we could work from there? Let me know what you think, and feel free to ask for any more information.

Cheers!

sbodenstein commented 4 years ago

Thanks for giving such a detailed update! One option I'm considering is focusing on the Swift implementation. I think that an Alpha Zero port would be a great showcase for the virtues of Swift for TensorFlow (eg. being able to write an efficient MCTS implementation directly in Swift). But would be more than happy to also help out on the Python version, and ensure the two implementations are consistent.

saeta commented 4 years ago

@sbodenstein if you run into issues or would like help debugging something with a S4TF implementation, please don't hesitate to reach out! :-)

taliesinb commented 4 years ago

@sbodenstein and I are planning on spending this coming week on this, we will report back here soon!

sbodenstein commented 4 years ago

An update from me: I should have a working Python AlphaZero implementation in the next few days that integrates cleanly into OpenSpiel and maximally reuses existing functionality (eg MCTSBot).

I'm getting a TicTacToe implementation working first. Perhaps @danielwillemsen we can work together to get it to work on Connect4 as well.

taliesinb commented 4 years ago

A quick update from me: I have an MCTS implementation in Swift. It is not yet tested, however. Will try to finish this off before Christmas, and at least sketch out how we can extend it to support AlphaZero.

sbodenstein commented 4 years ago

Just an update on my side: have gotten back to this today after the holidays, and have a Python AlphaZero implementation mostly ready (https://github.com/Aule-AI/open_spiel/tree/python_alpha_zero). Will make a PR and start a design discussion in the next day or two.

sbodenstein commented 4 years ago

Sorry, had less time than expected and only made the PR now (#134). I also haven't had time to try train on Connect4. The article referenced above makes me worried:

But soon after starting, we realized that even a simple game like Connect Four could require significant resources to train: in our initial implementation, training would have taken weeks on a single gpu-enabled computer.

@danielwillemsen: have you gotten Connect4 training in a reasonable amount of time, and without needing a whole slew of performance optimizations?

danielwillemsen commented 4 years ago

@sbodenstein Sorry for not having replied to your earlier comments before, I have been on holidays and have focused on my research.

I have not done many performance optimizations apart for parallelizing game playing. Training time depends a lot on the amount of performance you want to get from it.

After playing ~15000 games with 100 MCTS simulations per move, the neural net without any search seems comparable to MCTS search with 5000 simulations per move. This takes a few hours for me on a 4 gpu machine. CPU is the bottleneck though due to the python MCTS implementation.

Now this is less computation than they did: 40 generations of ~7000 games with 800 MCTS simulations per move.

This would be approximately 150x more neural network evaluations. In addition they use a larger neural net (20x128 filters vs 5x50 filters in my implementation).

The performance of my AlphaZero is thus probably significantly worse than theirs. I have never evaluated its performance on a solved-moves dataset like they did in that article.

So yes, training can be done in a reasonable amount of time, but the agent will be weaker.

Is this information of any use to you?

lanctot commented 4 years ago

We have recently merged the PR by @sbodenstein here: https://github.com/deepmind/open_spiel/pull/134

So I will close this now, but more contributions are still very welcome! Thanks for your interest in OpenSpiel.

google-deepmind / open_spiel

alphaZero implementation ideas and passing state to rl_agent #29