kblomdahl / dream-go

Artificial go player based on reinforcement and supervised learning
Apache License 2.0
47 stars 8 forks source link

Monte-Carlo tree search as regularized policy optimization #47

Open kblomdahl opened 4 years ago

kblomdahl commented 4 years ago

The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current stateof-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero’s search heuristics, along with other common ones such as UCT, are an approximation to the solution of a specific regularized policy optimization problem. With this insight, we propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.

http://proceedings.mlr.press/v119/grill20a/grill20a.pdf http://proceedings.mlr.press/v119/grill20a/grill20a-supp.pdf

kblomdahl commented 3 years ago

Initial implementation seems to work, but without a scheme such as virtual visits does not allow for sufficient parallelism to allow for acceptable performance.

0xJchen commented 2 years ago

Hi Chicoryn, I am also interested in this paper. I wonder what does virtual visits here mean? I guess the original dichotomy search scheme scales well with the number of legal actions (complexity of ~log(n)), and dding it to the action selection stage won't cost too much.

kblomdahl commented 2 years ago

@PeppaCat Hello, this was some time ago so I must admit I've forgotten a lot of the implementation details of my experiments. But virtual visits normally refers to a penalty [1] added to leaf nodes that are currently being expanded to avoid the next probe into the search tree from trying to expand the same node again. This is very important in real play scenarios where we are not restricted by the number of probes into the search tree, but rather by a real-time limit so it's better to fill each batch fully, instead of running a half-full batch through the value function.

As for your real question, I am not sure about the answer since as you indicate it seems like it should be fairly straight forward to introduce virtual losses into equation (3). I suspect that I did not have sufficient time / energy to investigate this further.

[1] https://github.com/Chicoryn/dream-go/blob/master/src/libdg_mcts/tree.rs#L1368