Exploiting Variance - Githubissues

Implement the paper:

Exploiting Variance Information in Monte-Carlo Tree Search Robert Lieck, Vien Ngo, Marc Toussaint

AlphaZero etc. do not use either of the classic definitions of U, but use the square of the simple regret minimisier. They also use the robust max for the final move choice, picking the most visited move.

This scheme calculates variance and reduces variance on the top move choices. This sends more visits to the second best move compared to A0. This makes max Q a better way of choosing the move. This may allow the algorithm to change its mind more easily, but could allow low sampled moves to be chosen unlike the most-visited-move choice.

The variance algorithm adds another hyper-parameter to search, which adds tuning and complexity.

Since in this algorithm we maximise U^2, using the simple regret minimiser for U allows us to reuse the C values that we previously found, making testing simpler. This could also be interpreted as A0 using U^2 in order to reduce variance of the simple regret minimiser, but using the robust max instead of an explicit variance term.

kmcrage / leela_lite

Exploiting Variance #1