kmcrage / leela_lite

A python implementation of lc0 by dkappe, ideal for testing new algorithms.
GNU General Public License v3.0
1 stars 0 forks source link

Exploiting Variance #1

Open kmcrage opened 5 years ago

kmcrage commented 5 years ago

Implement the paper:

Exploiting Variance Information in Monte-Carlo Tree Search Robert Lieck, Vien Ngo, Marc Toussaint

AlphaZero etc. do not use either of the classic definitions of U, but use the square of the simple regret minimisier. They also use the robust max for the final move choice, picking the most visited move.

This scheme calculates variance and reduces variance on the top move choices. This sends more visits to the second best move compared to A0. This makes max Q a better way of choosing the move. This may allow the algorithm to change its mind more easily, but could allow low sampled moves to be chosen unlike the most-visited-move choice.

The variance algorithm adds another hyper-parameter to search, which adds tuning and complexity.

Since in this algorithm we maximise U^2, using the simple regret minimiser for U allows us to reuse the C values that we previously found, making testing simpler. This could also be interpreted as A0 using U^2 in order to reduce variance of the simple regret minimiser, but using the robust max instead of an explicit variance term.

kmcrage commented 5 years ago

this behaves worse than using the robust max. when there are two candidate moves, we divide our visits between then so we are equally confident about both. But the robust max makes us more confident about the best candidate with more concentrated visits that are more likely to avoid traps.