Closed roy7 closed 5 years ago
I won't open this as another issue, but, normally when using TS for tree search don't you choose the move with highest win rate after search is over, not the one with most visits? I know there's not a lot of literature on TS + tree searching, but looking at Thompson Sampling Based Monte-Carlo Planning in POMDPs on page 34:
In more detail, the ThompsonSampling function has a boolean parameter sampling. If sampling is true, Thompson sampling method is used to stochastically select an action, otherwise a greedy action is returned with respect to current mean (formulas).
It seems using the branch as-is most search goes to higher prior moves and low prior moves get 0-1 visits, so a 1 visit move could end up getting chosen wrongly. With my current TS test copy of your branch, but using no prior info so I can start my tests with as very simple raw TS implementation, lots of moves get some search so using the winrate seems safe.
In your version, to protect against low visit outliers done at end of search, choosing move with highest LCB could be a safe way to ensure the win rate is 'real' vs just sorting on winrate alone.
Using the highest winrate would be rather unstable, I'd think. If you'd sample a move once, it wins, and you time out, you pick a move with a single visit that may well be a blunder.
we add in the parent info on every child
policy_eval is the policy prior, not the parent info
Using the highest winrate would be rather unstable, I'd think. If you'd sample a move once, it wins, and you time out, you pick a move with a single visit that may well be a blunder.
Yeah I saw this happen with the official thompson_sampling branch as-is since it's suing prior info to adjust search allocation. In vanilla TS though all nodes will get a fair bit of search so that seems unlikely to be an issue. Of course, vanilla TS is way too wide for Go and the use of priors is very important.
policy_eval is the policy prior, not the parent info
Whoops, I forgot policy info is in there. But actually:
auto policy_eval = parent_eval - policy_reduction;
It is parent_eval reduced by an amount based on policy prior. So should parent_eval be replaced by the child's winrate in cases where child.get_visits() > 0? Currently parent_eval still gets factored in to every child, unlike normal /next search.
It gets factored in (it's a prior) but the relevance should reduce with more visits.
I tried figuring in the parent q when selecting nodes. See here under "don't trust initial visits":
https://github.com/Videodr0me/leela-chess-experimental
In chess it did not seem to work, the signal from the first visits (as the effect of parent q dimishes over time) seem reliable enough.
closing older issues, no discussion for >6 months and I'm sure there are more recent discussions
https://github.com/gcp/leela-zero/blob/94a345f7cfcc47fff8a066e68d95376ccef9cde4/src/UCTNode.cpp#L288
In normal /next we overwrite the fpu value if the node has visits:
But in the thompson_sampling branch we add in the parent info on every child even after it has visits:
Is this intentional? Or should the std::tie happen after the success and failure are set so they can be overwritten? (Or move them into an else in case get_visits is 0.)