Supervised learning and samples

jonathan-laurent / AlphaZero.jl

A generic, simple and fast implementation of Deepmind's AlphaZero algorithm.

https://jonathan-laurent.github.io/AlphaZero.jl/stable/

MIT License

1.23k stars 136 forks source link

Supervised learning and samples #140

Open StepHaze opened 2 years ago

StepHaze commented 2 years ago

The idea was said by Jonathan: "I guess what you've have to do is generate many samples of the kind that are stored in AlphaZero's memory buffer. You can take these samples either from human play data or have other players play against each other to generate data. If you do so, be careful to add some exploration so that the same game is not played again and again and that you get some diversity in your data. Once you've got the data, you can either use the Trainer utility in learning.jl or just write your training procedure yourself in Flux."

Did anyone implement it? I still don't understand, in which format the games and moves are stored in memory buffer.

jonathan-laurent commented 2 years ago

will it ever be implemented?

Doing so is not my priority right now but I would be happy to welcome contributions here. Note that I am working on a rewrite of AlphaZero.jl that should be ready by the end of the summer so you might want to wait a little bit if your intent is to submit a PR

I still don't understand, in which format the games and moves are stored in memory buffer.

See src/memory.jl.

StepHaze commented 2 years ago

Thanks! You're going to rewrite of AlphaZero.jl, will our old code still work?

jonathan-laurent commented 2 years ago

Previous versions will still be accessible from git or the package manager. But the new version will break compatibility with existing code indeed.

StepHaze commented 2 years ago

What's the main reason of the rewriting of AlphaZero.jl? The existing version allows us to create pretty strong bots

jonathan-laurent commented 2 years ago

See https://github.com/jonathan-laurent/AlphaZero.jl/tree/master/redesign.

StepHaze commented 2 years ago

Thanks!

StepHaze commented 2 years ago

Could you please add a supervised learning feature in the next release so we can insert human-played games instead of self-play games. We can pay a reasonable price.

jonathan-laurent commented 2 years ago

I will keep this is mind, although I cannot make any promise right now.

StepHaze commented 2 years ago

Please! We really need it.

Your AlphaZero.jl is a WONDERFUL project. I must say you're a genius. I spent months trying to train my bots using Python's projects, and that was very slow and inefficient. With your masterpiece I trained my bot and spent a couple of days.

jonathan-laurent commented 2 years ago

Can you tell me more about how you or your company are using AlphaZero.jl and for what game/environment? It is always interesting for me to get this kind of feedback.

StepHaze commented 2 years ago

It's a non-commercial, educational project. I teach kids to play a board game (of a mancala family). We don't have a good software, so sometimes we don't even know where a player made a mistake. With AlphaZero.jl I created a bot that plays pretty strong, and "explore" function gives us an idea which moves are good and which are bad. Thanks for AlphaZero.jl!

jonathan-laurent commented 2 years ago

Thanks for the testimony. It is great to hear that AlphaZero.jl is being used successfully in an educational project.

StepHaze commented 2 years ago

Bot plays pretty strong, but still leaves much to be desired. And when I tried to complicate a vectorize_state (82x1x22), I started to get "Out of memory" error.

So I was thinking about a supervised learning. I have thousands of games played by masters. I looked at src.memory.jl and noticed the following: TrainingSample{State} Type of a training sample. A sample features the following fields:

s::State is the state
π::Vector{Float64} is the recorded MCTS policy for this position
z::Float64 is the discounted reward cumulated from state s
t::Float64 is the (average) number of moves remaining before the end of the game
n::Int is the number of times the state s was recorded

How can I define these values? All I have is thousands of games with moves and result. They weren't played using MCTS, so I don't know π, etc. values. Speaking frankly, I'm very confused.

StepHaze commented 2 years ago

I'm not a professional Julia programmer. I had to learn Julia to create a bot based on AlphaZero.jl

jonathan-laurent commented 2 years ago

First of all, a word of warning. I understand that you are not a trained programmer and it is all the nicer for me to learn that you were still able to use this package on your own game.

That being said, an algorithm such as AlphaZero can hardly be used as a black box and the moment you try and do something a bit unusual, there is no escaping from understanding the codebase and the underlying algorithm. In the long run, you may want to take the time to improve your Julia skills, read a bit about machine-learning and AlphaZero and then try and understand the codebase as a whole.

Regarding your current question, if you have a database of games played by humans, you can extract samples from it in the following way. In state s, you would set pi to a distribution that puts weight 1 on the action chosen by the human player and 0 elsewhere. Moreover, you would set z using the final outcome of the game that s is a part of.