I noticed in the original paper that the very first time the "exploration value" of the upper confidence bound
is calculated, the term turns out to be zero because no branch was every visited yet.
If this was correct, it would mean that the search is not guided by the prior probabilites at all initially (that can't be right?).
My question:
What is your rational for choosing + EPS instead of + 1? The way I see it this reduces the initial upper confidence for all actions by a factor of sqrt(EPS) = 1e-4. This changes the selection once some actions have been taken and others not. My intuition would have been to use the prior move probailities unaltered (i.e. + 1). What are your thoughts?
Actually I don't think it matters. My reasoning above was wrong. When calculating the upper confidence for the second action the numerator is nonzero for all actions.
I noticed in the original paper that the very first time the "exploration value" of the upper confidence bound
is calculated, the term
turns out to be zero because no branch was every visited yet.
If this was correct, it would mean that the search is not guided by the prior probabilites at all initially (that can't be right?).
Your implementation resolves this here: https://github.com/suragnair/alpha-zero-general/blob/master/MCTS.py#L115
And you discussed this initially here: https://github.com/suragnair/alpha-zero-general/issues/43
My question: What is your rational for choosing
+ EPS
instead of+ 1
? The way I see it this reduces the initial upper confidence for all actions by a factor ofsqrt(EPS) = 1e-4
. This changes the selection once some actions have been taken and others not. My intuition would have been to use the prior move probailities unaltered (i.e. + 1). What are your thoughts?