Initial exploration values in MCTS

I noticed in the original paper that the very first time the "exploration value" of the upper confidence bound grafik is calculated, the term grafik turns out to be zero because no branch was every visited yet.

If this was correct, it would mean that the search is not guided by the prior probabilites at all initially (that can't be right?).

Your implementation resolves this here: https://github.com/suragnair/alpha-zero-general/blob/master/MCTS.py#L115

And you discussed this initially here: https://github.com/suragnair/alpha-zero-general/issues/43

My question: What is your rational for choosing + EPS instead of + 1? The way I see it this reduces the initial upper confidence for all actions by a factor of sqrt(EPS) = 1e-4. This changes the selection once some actions have been taken and others not. My intuition would have been to use the prior move probailities unaltered (i.e. + 1). What are your thoughts?

suragnair / alpha-zero-general

Initial exploration values in MCTS #280