pytorch / ELF

ELF: a platform for game research with AlphaGoZero/AlphaZero reimplementation
Other
3.37k stars 567 forks source link

[Suggestion] clarify FPU in paper #140

Closed alreadydone closed 5 years ago

alreadydone commented 5 years ago

First-play urgency (FPU) is the Q assigned to unvisited nodes during tree search. There are several choices of this parameter:

Since ELF data says "root_unexplored_q_zero":false,"unexplored_q_zero":false I assume it's not the third approach. Could you please clarify, hopefully also in a future version of the paper, the FPU setting you've used? It is an important parameter, since apparently Minigo [1] (late v14 and all subsequent runs) and Leela Chess Zero (late Test30 and all subsequent runs) found large benefit from the FPU = loss setting; FPU could affect ladder behavior as well: FPU = loss probably helps with deep tactical lines. Thanks for your attention.

[1] https://github.com/tensorflow/minigo/blob/master/RESULTS.md#v14-bigtable-integration-with-init-to-loss-switch

Around model 280, we discovered on LCZero forums that AlphaGoZero had used init-to-loss for Q values. We switched to init-to-loss from init-to-parent and saw our network gain dramatically in strength. Soon, our top model thoroughly exceeded our previous best model

yuandong-tian commented 5 years ago

That's a good question.

It uses none of the above. OpenGo uses the current running mean Q of the parent node when the child node is created. If there is no current running mean Q, use net val (or draw value if root_unexplored_q_zero (at root layer) or unexplored_q_zero is true). See:

https://github.com/pytorch/ELF/blob/master/src_cpp/elf/ai/tree_search/tree_search_node.h#L298 https://github.com/pytorch/ELF/blob/master/src_cpp/elf/ai/tree_search/tree_search_node.h#L227

I don't know whether it is a good choice. But looks like it works.

alreadydone commented 5 years ago

Thanks for the clarification! The current running mean Q of the parent is basically what I meant by

the parent node's Q (average over whole subtree) in my original post. But then the question seems to be: when is the child node created? I can only think of two possibilities:

  • either it is created when the parent is expanded, i.e. right after the policy and value data of the parent is ready, so that the parent Q is just the net val,
  • or it is created when it is selected by the UCT algorithm.

However in the second case, in order for it to be selected, its Q would have already been used to calculate the action value Q+U, and the crucial problem is what Q was used to make the selection. I guess what you mean is that the parent's running mean Q is used to make the selection; if so this is exactly what I meant by the parent node's Q in my OP. If so, I guess that the parent's running mean Q would also be used in previous rollouts that reached the parent but didn't select this child, and this running mean could be different each time. (However, since you only need to compute Q+U of the unvisited child of the highest policy prior, each child's Q only needs to be called once before it's expanded.) Let me know if there's something I don't get. (P.S. I spent some time to read the code before but I am not sure I completely get the behavior.)

yuandong-tian commented 5 years ago

It is the second case (i.e., when the child node is selected and created by the UCT algorithm).

Exactly, the running mean is different for each child being created. The reason is that we want to use the most accurate estimate so far for the newly created children.

alreadydone commented 5 years ago

Totally clear; thanks again!