suragnair / alpha-zero-general

A clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more
MIT License
3.74k stars 1.01k forks source link

Initial exploration values in MCTS #280

Closed t4n0 closed 1 year ago

t4n0 commented 1 year ago

I noticed in the original paper that the very first time the "exploration value" of the upper confidence bound grafik is calculated, the term grafik turns out to be zero because no branch was every visited yet.

If this was correct, it would mean that the search is not guided by the prior probabilites at all initially (that can't be right?).

Your implementation resolves this here: https://github.com/suragnair/alpha-zero-general/blob/master/MCTS.py#L115

And you discussed this initially here: https://github.com/suragnair/alpha-zero-general/issues/43

My question: What is your rational for choosing + EPS instead of + 1? The way I see it this reduces the initial upper confidence for all actions by a factor of sqrt(EPS) = 1e-4. This changes the selection once some actions have been taken and others not. My intuition would have been to use the prior move probailities unaltered (i.e. + 1). What are your thoughts?

t4n0 commented 1 year ago

Actually I don't think it matters. My reasoning above was wrong. When calculating the upper confidence for the second action the numerator is nonzero for all actions.