leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.
GNU General Public License v3.0
5.33k stars 1.01k forks source link

Questions about PUCT Algorithm in AGZ #2409

Open likefanfan520 opened 5 years ago

likefanfan520 commented 5 years ago

Hi All,

I have a question about PUCT algorithm:

In the PUCT algorithm of AGZ, the numerator of the right hand side is sqrt(sum of N). Here N is the visit counts of the children. The first time when search thread wants to select a child(here all children are leaf nodes), N is Zero because no children have been searched, and Q of the children are initialized to 0(the children are not evaluated yet, so the Q value is initial value zero), this results in Q + U = 0. the Q+U values of the children are all the same, so the search thread cannot make a decision. I think there must be something wrong, but I cannot find out the problem.

Thanks!

y-ich commented 5 years ago

@likefanfan520 san,

I also had the same question and investigated by reading pseudocode.py in the supplementary materials of AZ paper (not AGZ paper). You can get the materials from https://science.sciencemag.org/content/suppl/2018/12/05/362.6419.1140.DC1?_ga=2.250263942.1495924003.1559189273-1133240774.1559189273.

The followings are what I understood:

  1. In AZ, the argument of sqrt is not sum of N, is parent visit count. Parent visit count is equal to (sum of N) + 1 since the count increments when the node is evaluated, too. So the first time when search thread wants to select a child, parent visit count is one, not zero. Thus policies in the second term are effective even in the first time.
  2. Q of the children are certainly initialized to 0, but the important point is that 0 means loss value. Q is a value in the range of [0, 1], not [-1, 1] in MCTS though the correspondence for neural network is a value in the range [-1, 1].
likefanfan520 commented 5 years ago

Q is a value in the range of [0, 1], not [-1, 1] in MCTS though the correspondence for neural network is a value in the range [-1, 1].

  1. I have also read AZ and have the same point of view as you, but I think sometimes there are still problems. For example, if Parent node is a root node(but not the first move of the game), then Parent visit count is not equal to (sum of N) + 1 because N increased by search thread launch from Parent (root node). In this situation, Parent visit count < (sum of N) + 1.

  2. I don't understand your meaning here. Q is W / N, I think that Q should be a value in the range of [-1, 1] because W is in [-1, 1]. ( In Backup phase, W is increased by the backup value from leaf node evaluated by neural network. If neural network thinks the situation is bad, it gives a negative value < 0) So why did you say "Q is a value in the range of [0, 1]"? I cannot figure out.

thanks!

y-ich commented 5 years ago

@likefanfan520 san,

  1. (I am sorry that I could not understand your point.) Parent visit count is one after the node is evaluated. It means that parent visit count is non-zero when calculating UCB of children even for the first time.

  2. I was also so surprised when I knew it^^. A person in DeepMind wrote it. See http://talkchess.com/forum3/viewtopic.php?f=2&t=69175&start=70&sid=8eb37b9c943011e51c0c3a88b427b745

likefanfan520 commented 5 years ago
  1. I was also so surprised when I knew it^^. A person in DeepMind wrote it. See http://talkchess.com/forum3/viewtopic.php?f=2&t=69175&start=70&sid=8eb37b9c943011e51c0c3a88b427b745

Thanks a lot for your information.

But I cannot understand why Q is initialized to loss value. If all edges are initialized to loss value, the child which is selected first time would have lager Q than the others. Why did they design so? But I think after all children are evaluated, the problem(how to initialize Q value) doesn't matter at all because Q is updated to average W.

y-ich commented 5 years ago

@likefanfan520 san,

He explained in the thread "Assumption being that most positions have 1 or at most a few good moves. All other moves are akin to passing or worse. In most equal-ish positions, passing will give the opponent a big advantage."

Ishinoshita commented 5 years ago

@likefanfan520 'Init to loss' scheme (prefered wording to avoid ambiguity) amount to say that the child prefered by policy is strongly favored in the beginning of the search, at any sub-point in the tree. Amounts to say other children are all bad (loss value) until UCT mechanics give them a chance to prove this assumption wrong (by receiving one first visit). Makes the search much more narrow / policy-driven in early stages, which was proved to increase play strength, as reproduced by Minigo last runs (introduced half way of run 14, applied for runs 15 to 17).

This being said, at rooot, high level of batching forces many children to be evaluated 'whaterver' their policy, which amounts to assume ~'init to win' in early phase of tree growth.

likefanfan520 commented 5 years ago

@likefanfan520 'Init to loss' scheme (prefered wording to avoid ambiguity) amount to say that the child prefered by policy is strongly favored in the beginning of the search, at any sub-point in the tree. Amounts to say other children are all bad (loss value) until UCT mechanics give them a chance to prove this assumption wrong (by receiving one first visit). Makes the search much more narrow / policy-driven in early stages, which was proved to increase play strength, as reproduced by Minigo last runs (introduced half way of run 14, applied for runs 15 to 17).

This being said, at rooot, high level of batching forces many children to be evaluated 'whaterver' their policy, which amounts to assume ~'init to win' in early phase of tree growth.

I got it. Thank you very much!

likefanfan520 commented 5 years ago
  1. In AZ, the argument of sqrt is not sum of N, is parent visit count. Parent visit count is equal to (sum of N) + 1 since the count increments when the node is evaluated, too. So the first time when search thread wants to select a child, parent visit count is one, not zero. Thus policies in the second term are effective even in the first time.

But why in leela zero code (version 0.17), the numerator is this:

const auto numerator = std::sqrt(double(parentvisits) std::log(cfg_logpuct double(parentvisits) + cfg_logconst));

where "parentvisits" is sum of child visit counts:

auto parentvisits = size_t{0}; for (const auto& child : m_children) { if (child.valid()) { parentvisits += child.get_visits(); if (child.get_visits() > 0) { total_visited_policy += child.get_policy(); } } }

y-ich commented 5 years ago

@likefanfan520 san,

Someone of Leela Zero developers may answer your question.

What I, who am just a Leela Zero user, know is that Leela Zero has been developed before AZ paper was published.

alreadydone commented 5 years ago

See https://github.com/leela-zero/leela-zero/pull/2072. (git blame is your friend, and can be activated by hitting the B key in Github code view.)

likefanfan520 commented 5 years ago
  1. I was also so surprised when I knew it^^. A person in DeepMind wrote it. See http://talkchess.com/forum3/viewtopic.php?f=2&t=69175&start=70&sid=8eb37b9c943011e51c0c3a88b427b745

Do you know any other member in Deep Mind disclosing more details about AZ or AGZ somewhere? I benefit a great deal by reading his responses.

y-ich commented 5 years ago

@likefanfan520 san,

That was only one which I know.

likefanfan520 commented 5 years ago

@likefanfan520 san,

That was only one which I know.

Why doesn't "virtual loss" exist in pseudocode? Do you know how AZ sets this value?

y-ich commented 5 years ago

@likefanfan520 san,

Ask DeepMind it^^ Parallelism seems not essential for AZ paper. I have experiences to implement AZ for playing(, not for training), but I could not notice any differences between 1 and 3 for virtual loss. You may need many samples to observe a difference.

likefanfan520 commented 5 years ago

@y-ich @Ishinoshita

I still have three questions about this topic:

  1. If using "initialized to loss value" scheme, how does "virtual loss" work? It seems there is no punishment effect in "virtual loss" in this case because virtual loss is equal to initialized value.

  2. In 'Init to loss' scheme, most of the search threads would be very likely to select the same child which having the highest P-value because its V-value would be also most probably higher than the other children's (the other children's V-value are initialized to loss value, which is the minimum value). It seems the only ways left for the other children to be selected are "Dirichlet Noise" and the sqrt(n) / (1 + n) term in the UCB formula.

  3. Does AZ adopt "Virtual loss" or "parallel searching" in MCTS? Or AZ gives up "parallel searching" existing in AGZ paper? I didn't find out any "Virtual loss" or multi threading code in AZ pseudocode:

def run_mcts(config: AlphaZeroConfig, game: Game, network: Network):
  root = Node(0)
  evaluate(root, game, network)
  add_exploration_noise(config, root)

  for _ in range(config.num_simulations):
    node = root
    scratch_game = game.clone()
    search_path = [node]

    while node.expanded():
      action, node = select_child(config, node)
      scratch_game.apply(action)
      search_path.append(node)

    value = evaluate(node, scratch_game, network)
    backpropagate(search_path, value, scratch_game.to_play())
  return select_action(config, game, root), root
y-ich commented 5 years ago

@likefanfan520 san,

  1. "Init to loss" scheme affects only UCBs of not-evaluated children. "virtual loss" affects only UCBs of children which are being evaluated by other threads. They are different schemes.

  2. You can know how "Init to loss" scheme is better than "init to even" scheme when you think the case that a current situation is bad. In the case of "init to even" scheme, all children will be selected soon since all values of children will be bad. This behavior wastes many playouts. In the case of "init to loss" scheme, even if current situation is bad, higher P nodes will be selected.

  3. Parallelism is essential to maximize TPU power. They should have used it and virtual loss. But parallelism is not essential for the paper, so they omitted it in the pseudo code.

likefanfan520 commented 5 years ago

@y-ich

  1. Thanks for your answers. I know they are different schemes. I mean, in the program if you use both "init to loss" and "virtual loss", then it seems "virtual loss" has no effect. The purpose of "virtual loss" is to reduce Q-value when the search thread has not got the back-up value from the leaf node so that the other search threads would not select the same path simultaneously. Now you also use "init to loss", the Q-values in UCBs of the nodes on the other paths are identical "loss value" (equal to -1). This doesn't make the other nodes on different paths selected more easily. "Init to loss" seems to offset the effect from "virtual loss".

  2. Yes you are right in some cases but not always. I think, for example, if the current situation is "even", then "init to loss" would make it difficult to find out the "correct" move. In this situation, "init to loss" search method is too stubborn to change from stupid action. (especially when the beginning of the training, the moves selected by high P-value are almost picked up randomly with random weights of neural network)

y-ich commented 5 years ago

@likefanfan520 san,

  1. You can imagine an extreme case. After you evaluated all children at least once, “init to loss” scheme has no effects but “virtual loss” keeps effects.
  2. I have no idea how it behaves in the training...
likefanfan520 commented 5 years ago

@y-ich Yes, I think you are right. Thanks!