Question about the improved policy in Gumbel MuZero

google-deepmind / mctx

Monte Carlo tree search in JAX

Apache License 2.0

2.31k stars 188 forks source link

Question about the improved policy in Gumbel MuZero #50

Closed karroyan closed 1 year ago

karroyan commented 1 year ago

Hi, thanks for open sourcing the great library!

I'm using it to experiment with MCTS on a project, and I have a question regarding the function \sigma used in constructing the improved policy: \pi'=softmax(logits+\sigma completedQ).

I noticed that the scale of the function determines the weight between logits and completedQ, which in turn affects the child selection in MCTS. In my experiments with tictactoe, I discovered that a larger weight of completedQ may lead to a higher reward in MCTS.

The paper only states that \sigma is a monotonically increasing function, but I was wondering if there are any other limitations or discussions on the format of \sigma?

Thank you!

fidlej commented 1 year ago

Thanks for the interesting question. In the code, we use qtransform to transform the Q-values.

If the Q-values are perfectly estimated and only one action has the highest Q-value, then you can act with this best action. On the other hand, the following situations may benefit from a combination of the logits and the Q-values:

a) If the Q-values are approximate. b) If the agent has a limited representation of the state. c) If the environment is a zero-sum imperfect-information game (e.g., poker, StarCraft).

In b) and c), the best policy may need to be stochastic. There, multiple actions would have the same highest Q-value.

In tic-tac-toe, the state of the game is perfectly represented by the board position. So the higher scale for the Q-values seems like a good idea there.

karroyan commented 1 year ago

Thank you for your answer! I appreciate your explanation of the experiment results, which seems reasonable to me. However, I still have a question regarding the code.

I noticed that the sigma function in the code is a linear transformation, specifically (maxvisit_init + visit_count) * value_scale, where maxvisit_init is set as 50 and value_scale is set as 0.1. I'm curious about how to set these two hyperparameters in a way that would yield good performance in different environments, such as Gomoku. Do you have any advice or best practices? Thank you!

fidlej commented 1 year ago

The default values for gumbel_muzero worked well on Go, chess, and Atari. If you use a small number of simulations, you may try to increasing the value_scale. The linear transformation makes sense if the Q-values are stochastic, the mean is then correctly estimated by averaging.