jonathan-laurent / AlphaZero.jl

A generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
https://jonathan-laurent.github.io/AlphaZero.jl/stable/
MIT License
1.24k stars 141 forks source link

API discussion #4

Closed findmyway closed 3 years ago

findmyway commented 4 years ago

Hi @jonathan-laurent ,

This project is really awesome!

Since you mentioned it in the doc Develop-support-for-a-more-general-game-interface, I'd like to write down some thoughts and discuss them with you.

Here I'll mainly focus on the Game Interface and MCTS parts. In the meanwhile, the design differences between AlphaZero.jl, ReinforcementLearningBase.jl and OpenSpiel are also listed.

Game Interface

To implement a new game, we have some assumptions according to the Game Interface:

  1. Zero-sum
  2. Two players
  3. No chance node
  4. Symmetric (optional)

If I understand it correctly, two main concepts are Game and Board.

In OpenSpiel, those two concepts are almost the same (the Board is named state in OpenSpiel), except that the state is not contained in the Game, which means Game is just a static description (history is not contained in game but state).

In RLBase, the Game is treated as an AbstractEnvironment and the Board is just the observation of the env from the aspect of a player.

In this view, most of the interfaces in this package are aligned with those in RLBase. Following are the detailed description:

  1. AbstractGame -> AbstractEnv
  2. board -> observe
  3. Action -> get_action_space
  4. white_playing -> get_current_player
  5. white_reward -> get_reward
  6. board_symmetric -> missing in RLBase. Need to define a new trait to specify whether the state of a game is symmetric or not
  7. available_actions -> get_legal_actions
  8. actions_mask -> get_legal_actions_mask
  9. play! -> (env::Abstractenv)(action)
  10. heuristic_value -> missing in RLBase.
  11. vectorize_board -> get_state
  12. symmetries -> missing in RLBase
  13. game_terminated -> get_terminal
  14. num_actions -> length(action_space)
  15. board_dim -> size(rand(observation_space)
  16. random_symmetric_state -> missing in RLBase

I think it won't be very difficult to adapt to use OpenSpiel.jl or even to use the interfaces in RLBase.

MCTS

I really like the implementation of asnyc MCTS in this package. I would like to see it is separated as a standalone package.

The naming of some types is slightly strange to me. For example, there's an Oracle{Game} abstract type. If I understand it correctly, it is used in the rollout step to select an action. The first time I saw the name of Oracle, I supposed its subtypes must implement some smart algorithms 😆 . But in MCTS it is usually a light-weight method, am I right?

The implementation of Worker assumes that there are only two players in the game. Do you have any idea how to expand it to apply for multi-players games?

At the first glance, I thought the async MCTS used some kind of root level or tree level parallelization. But I can't find that the multi-threading is used in the code anywhere. It seems that the async part is mainly to collect a batch of states and get the evaluation results once for all. Am I right here? It would be better if you could share some implementation considerations here 😄

Also cc @jbrea 😄

jonathan-laurent commented 4 years ago

Thanks for your interest in AlphaZero.jl and for your thoughtful post!

Game Interface

I agree with your assessment, modulo a few remarks:

More generally, I completely agree with the goal of moving towards a standard game interface.

MCTS

findmyway commented 4 years ago

Thanks for your explanation. Things are more clear to me now.

Also, people often talk about "neural oracles" so I feel that the name is especially relevant here

I see. Sorry that I might have confused myself with the concept of oracle in active learning.

As you said, the main gain comes from being able to batch evaluation requests, which results in a 20x speedup in my connect four benchmark.

That's an amazing improvement!

Since you mentioned MuZero, do you have any plan to implement it? 😄

jonathan-laurent commented 4 years ago

You're welcome. Thanks again for sharing your thoughts and for your work on RLBase.jl and OpenSpel.jl.

Since you mentioned MuZero, do you have any plan to implement it? 😄

This is a great question. At some point, I think it should be possible to have a framework where both algorithms can just be instantiated by assembling slightly different sets of primitives. This is another thing that should guide our API discussion.

findmyway commented 4 years ago

I just attempted to adapt the OpenSpiel/RLBase to the existing GameInterface. This is an easier way to verify whether we can safely replace GameInterface with RLBase or not.

The biggest problem now is the board_symmetric related APIs. Essentially the problem comes from that the concept of Board is exposed in AlphaZero.jl. And it is heavily used throughout different modules. In other packages, only the vectorize_board is exposed.

I have a feeling that it won't be that easy to drop it. But we do have to rethink which parts are essential to the game we are about to handle and what are the enhancements.


The following are some ad-hoc codes to implement a new Game with OpenSpiel.jl.

using OpenSpiel
using ReinforcementLearningBase
using ReinforcementLearningEnvironments

import AlphaZero.GI

tic_tac_toe = OpenSpielEnv("tic_tac_toe")

GI.Board(::Type{<:OpenSpielEnv}) = typeof(tic_tac_toe.state)
GI.Action(::Type{<:OpenSpielEnv}) = Int

# !!!
GI.actions(::Type{<:OpenSpielEnv}) = get_action_space(tic_tac_toe).span

Base.copy(g::OpenSpielEnv{O,D,S,G,R}) where {O,D,S,G,R} = OpenSpielEnv{O,D,S,G,R}(copy(g.state), g.game, g.rng)

GI.actions_mask(g::OpenSpielEnv) = get_legal_actions_mask(observe(g))

GI.board(g::OpenSpielEnv) = g.state
GI.white_playing(g::OpenSpielEnv) = get_current_player(g) == 0

function GI.white_reward(g::OpenSpielEnv)
    obs = observe(g, 0)
    get_terminal(obs) ? get_reward(obs) : nothing
end

GI.play!(g::OpenSpielEnv, pos) = g(pos)
GI.vectorize_board(::Type{<:OpenSpielEnv}, board) = get_state(board)
GI.available_actions(g::OpenSpielEnv) = get_legal_actions(observe(g))

OpenSpielEnv{O,D,S,G,R}(state) where {O,D,S,G,R} = OpenSpielEnv{O,D,S,G,R}(state, tic_tac_toe.game, tic_tac_toe.rng)

Base.:(==)(a::typeof(tic_tac_toe.state), b::typeof(tic_tac_toe.state)) = observation_tensor(a) == observation_tensor(b)

GI.canonical_board(g::OpenSpielEnv) = g.state
jonathan-laurent commented 4 years ago

Thanks @findmyway for having a deeper look into integrating AlphaZero.jl with OpenSpiel/RLBase.

I agree with you that the biggest obstacle right now is the symmetry assumption, which is exploited in several different places. As I explained in a previous post, I used to believe that it was the right thing to do (at least for the standard board games where AlphaZero has been applied first and which happen to be symmetric) but I am now getting convinced that the resulting optimizations are not worth the cost in terms of generality.

I agree that dropping it will require a fair amount of refactoring across the codebase but the codebase is pretty small to start with so nothing too scary here! I am happy to take a stab at it myself.

findmyway commented 4 years ago

That's great. Feel free to ping me if there's any progress.

(By the way, could you tag a release before refactoring? So that we can easily compare and discuss the diffs)

jonathan-laurent commented 4 years ago

I will tag a release. I just wanted to wait a little bit in case someone uncovers a problem right after the public announcement.

oscardssmith commented 4 years ago

I personally find the mcts implimentation here a little hard to follow. I think the most readable I've seen is https://github.com/dkappe/a0lite. The master branch has a very simple version, but the fancy branch also has batching and pruning.

jonathan-laurent commented 4 years ago

@oscardssmith Thanks for the feedback. Could you be more specific and tell me what made it hard to follow for you?

One thing that probably makes it confusing is that it contains two implementations in one. There is both a synchronous and an asynchronous version of MCTS, that share most of their code.

I should also probably add comments to explain the overall structure. :-) Please tell me if you can see anything else.

oscardssmith commented 4 years ago

A lot of it is naming and ordering I think. I'm pretty sure ActionStats is, but I don't know just by looking whether W is the sum of accumulated rewards or average or something else. Similarly in BoardInfo, I am somewhat unsure what Vest is, maybe estimated value? Also, having tree :: Dict{Board, BoardInfo} will produce wildly incorrect results unless Board tracks all the history, since merging nodes which have had different paths to get there can prevent MCTS from converging.

jonathan-laurent commented 4 years ago

Also, having tree :: Dict{Board, BoardInfo} will produce wildly incorrect results unless Board tracks all the history, since merging nodes which have had different paths to get there can prevent MCTS from converging.

@oscardssmith You are raising a very interesting point, which prompted me to do some research. From what I've seen, this is actually a bit of a controversial issue.

Merging nodes that correspond to identical states has actually been presented as an MCTS optimization. See for example section 5.2.4 in this 2012 MCTS survey from Browne et al.. Intuitively at least, this makes sense: you do not want to waste computing time estimating the value of a same state several times (at least when the system is Markovian).

Some people seem to disagree on whether or not this may bias exploration/exploitation. I personally found the argument for merging more convincing.

Here, people are warning that transposition tables should be used with care when imperfect information is involved, which is unsurprising. AlphaZero.jl is targeting games of perfect information though.

I would be interested to have your opinion on this debate.

oscardssmith commented 4 years ago

Suppose you have the following part of a move graph: A->B->D, A->C->D, C->E, where A is your turn. Let D be good for you, and E be bad for you. If you merge the two D nodes, then C will always get more visits than B since the only difference is it has one extra child. Thus your MCTS converges to pick an arbitrarily bad move.

jonathan-laurent commented 4 years ago

Suppose you have the following part of a move graph: A->B->D, A->C->D, C->E, where A is your turn. Let D be good for you, and E be bad for you. If you merge the two D nodes, then C will always get more visits than B since the only difference is it has one extra child. Thus your MCTS converges to pick an arbitrarily bad move.

I disagree that C will get more visits than B. With the current implementation, I would expect MCTS to ultimately only visit B. I guess we could test this claim easily but I think our misunderstanding might be resolved by noting that in the current implementation, there are separate counters for "number of visits of D coming from B" and "number of visits of D coming from "C". These numbers, which we can write N_B(D) and N_C(D), are stored in nodes B and C respectively. Therefore, observing a success after traversing the path A -> B -> D does not increase N_A(C) and therefore does not encourage further exploration of C.

oscardssmith commented 4 years ago

You're right that I hadn't noticed that. That, however raises a couple questions: If the value of D drops, does C become aware of that drop ever? If you only propagate your n's back along the 1 path you visited, do you also only propagate the win statistics? Also, does this mean that D has a different N when visited via B and C?

jonathan-laurent commented 4 years ago

If the value of D drops, does C become aware of that drop ever?

You are raising an excellent point. The thing is: there is no such thing as an estimate of the value of D. There are only estimates of Q_B(D) and Q_C(D). (I am abusing notation slightly here, I hope you don't mind.) Observing a success after traversing A -> B -> D would update Q_B(D) but not Q_C(D).

If you only propagate your n's back along the 1 path you visited, do you also only propagate the win statistics?

Yes, but this is not a problem as you're only updating a Q function estimate.

Also, does this mean that D has a different N when visited via B and C?

Yes. N_B(D) and N_C(D) are completely distinct values.

jonathan-laurent commented 4 years ago

Overall, I think that what we are coming at is this: there are two fundamentally different ways to implement MCTS. One in which you try to approximate the value function, and one in which you try to approximate the Q function. This raises two questions:

I believed that Deepmind's AlphaGo Zero was using an implementation similar to mine. Indeed, reading their paper, I thought it was pretty clear they are storing Q-values and not values. But you're making me doubt this now.

oscardssmith commented 4 years ago

I just re-read the relevant parts of the 3 A0 papers (alphago, alphagozero, alphazero), and I think I can better explain how Lc0 has done it. Basically, if you note that Q(s,a)=V(s'), and N(s,a)=N(s'), you now have a way to store your Q and N in the nodes rather than edges. The edges then store Moves and Policies. The nice things about this design are that you don't need to store board states anywhere (since you can just make and unmake moves), and it ends up being memory efficient since you never are initializing things you don't need.

jonathan-laurent commented 4 years ago

I think you are right. The implementation you're referring to is more memory efficient. On the other hand, my implementation makes it easier to merge nodes. The Lc0 approach is probably a better default though, and I may implement it at some point. In any case, I should add comments in the code for clarity.

Thanks for this very interesting conversation. :-)

deveshjawla commented 3 years ago

The implementation of Worker assumes that there are only two players in the game. Do you have any idea how to expand it to apply for multi-players games?

@findmyway Can you explain what you mean by Worker?

jonathan-laurent commented 3 years ago

The implementation of Worker assumes that there are only two players in the game. Do you have any idea how to expand it to apply for multi-players games?

@findmyway Can you explain what you mean by Worker?

This refers to and old MCTS implementation that has been replaced since.

Regarding multi-player games, there are two possible situations:

Also, I implemented full support for MDPs and compatibility with CommonRLInterface.jl, which can be found on the common-rl-intf branch. I am going to merge with master and release v0.4 as soon as Julia 1.5.3 gets released with this fix.

deveshjawla commented 3 years ago

@jonathan-laurent This segmentation fault only occurs when using GPU right? I upgraded to Julia1.5.3 and running on CPU is smooth but with GPU it throws a segfault.

And there is another error: `/home/sdeveshj/.julia/packages/GR/RlE5Y/src/../deps/gr/bin/gksqt: error while loading shared libraries: libQt5Widgets.so.5: cannot open shared object file: No such file or directory connect: Connection refused GKS: can't connect to GKS socket application

GKS: Open failed in routine OPEN_WS GKS: GKS not in proper state. GKS must be either in the state WSOP or WSAC in routine ACTIVATE_WS ` Do you know what this is? Screenshot from 2020-11-20 15-44-40

jonathan-laurent commented 3 years ago

I think the segfault on 1.5.2 was unrelated to GPU handling and it was fixed by this PR: https://github.com/JuliaLang/julia/pull/37594. Can you share more details on the segfault you are getting when using a GPU?

Regarding your other error (GKS: can't connect to GKS socket application), it is a known problem with Plots.jl (see https://github.com/JuliaPlots/Plots.jl/issues/1649) but it does not crash the program and does not seem to keep the plots from being generated either.

deveshjawla commented 3 years ago

I wonder what you changed in the commonrlintf branch that this GKS error is met. True, it does not stop the program and the plots are being generated correctly. However, was your intention to display the plots right during the training?

The segfault appears to happen inconsistently(sometimes after 1 iteration, sometime after 2nd iteration) on CUDA11.0. But on CUDA11.1 even after 20 iterations there was no segfault.

`Starting self-play

    Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:23

Generating 88 samples per second on average
Average exploration depth: 10.9
MCTS memory footprint: 44.13kB
Experience buffer size: 21,000 (21 distinct boards)

`signal (11): Segmentation fault in expression starting at /home/sdeveshj/alphazero_inbestme/scripts/alphazero.jl:50 unknown function (ip: 0x7f9437bc35c5) free_gemm_select at /usr/local/cuda-11.0/lib64/libcublasLt.so.11 (unknown line) cublasDestroy_v2 at /home/sdeveshj/.julia/artifacts/7670a86e87b992ec4edb1c9eaeca8e57e15eeb4c/lib/libcublas.so.11 (unknown line) macro expansion at /home/sdeveshj/.julia/packages/CUDA/0p5fn/lib/cublas/libcublas.jl:13 [inlined] macro expansion at /home/sdeveshj/.julia/packages/CUDA/0p5fn/src/pool.jl:430 [inlined] macro expansion at /home/sdeveshj/.julia/packages/CUDA/0p5fn/lib/cublas/error.jl:56 [inlined] cublasDestroy_v2 at /home/sdeveshj/.julia/packages/CUDA/0p5fn/lib/utils/call.jl:93

658 at /home/sdeveshj/.julia/packages/CUDA/0p5fn/lib/cublas/CUBLAS.jl:78 [inlined]

context! at /home/sdeveshj/.julia/packages/CUDA/0p5fn/src/state.jl:188

657 at /home/sdeveshj/.julia/packages/CUDA/0p5fn/lib/cublas/CUBLAS.jl:77

unknown function (ip: 0x7f9460038c4f) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398 jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined] run_finalizer at /buildworker/worker/package_linux64/build/src/gc.c:277 jl_gc_run_finalizers_in_list at /buildworker/worker/package_linux64/build/src/gc.c:363 run_finalizers at /buildworker/worker/package_linux64/build/src/gc.c:391 [inlined] jl_gc_collect at /buildworker/worker/package_linux64/build/src/gc.c:3127 gc at ./gcutils.jl:79 [inlined] gc at /home/sdeveshj/alphazero_inbestme/src/networks/flux.jl:138 _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398 learning_status at /home/sdeveshj/alphazero_inbestme/src/learning.jl:165 learning_step! at /home/sdeveshj/alphazero_inbestme/src/training.jl:208 macro expansion at ./timing.jl:310 [inlined] macro expansion at /home/sdeveshj/alphazero_inbestme/src/report.jl:267 [inlined] train! at /home/sdeveshj/alphazero_inbestme/src/training.jl:326 resume! at /home/sdeveshj/alphazero_inbestme/src/ui/session.jl:315 _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2231 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398 jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined] do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:117 eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:206 eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:157 [inlined] eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:566 jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:660 jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:840 jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:913 jl_load_rewrite at /buildworker/worker/package_linux64/build/src/toplevel.c:914 include at ./Base.jl:380 include at ./Base.jl:368 _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398 exec_options at ./client.jl:296 _start at ./client.jl:506 jfptrstart_53898.clone_1 at /home/sdeveshj/julia-1.5.3/lib/julia/sys.so (unknown line) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398 unknown function (ip: 0x401931) unknown function (ip: 0x401533) libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) unknown function (ip: 0x4015d4) Allocations: 839087620 (Pool: 838766609; Big: 321011); GC: 416 Segmentation fault (core dumped) `

jonathan-laurent commented 3 years ago

I wonder what you changed in the commonrlintf branch that this GKS error is met.

Actually, I've had this error since the first release on my machine so I am not sure it has to do with common-rl-intf.

The segfault appears to happen inconsistently (sometimes after 1 iteration, sometime after 2nd iteration) on CUDA11.0. But on CUDA11.1 even after 20 iterations there was no segfault.

Interesting. I also had some issues with CUDA11.0. I am wondering what changed with 11.1.

deveshjawla commented 3 years ago

In the commom-rl-intf branch you removed:

# ENV["CUARRAYS_MEMORY_LIMIT"] = 7_500_000_000
ENV["JULIA_CUDA_MEMORY_POOL"] = "split" # "binned" / "split"

# Enables running the script on a distant machine without an X server
ENV["GKSwstype"]="nul"

from scripts/alphazero.jl

I believe that's why the GKS warning was being prompted.

Adding these two lines back allowed me to train the connect-four as well.

deveshjawla commented 3 years ago

The average reward is constant across iterations for the Alphazero benchmark. Do you know why this could be or which part of the code to check?

Screenshot from 2020-11-23 02-19-01 Screenshot from 2020-11-23 02-19-23

jonathan-laurent commented 3 years ago

@deveshjawla What example are you talking about? If you are referring to my grid world example, it is not tuned yet and the hyperparameters are completely off. Still, it should make progress during the very first iterations.

deveshjawla commented 3 years ago

Hi, it's the trading game. The grid world example trains fine where the average reward across iterations varies. I'm using RL.reset! to feed random initial states(from a set of ~200) just like in the grid world example. Each initial state has about 2 million ways the game can proceed. What would be helpful to know is, how to ensure that at every iteration we start with a different set of initial states?

deveshjawla commented 3 years ago

Ok, the average reward in the Alphazero benchmark is increasing with each iteration now. Solved.