google-deepmind / mctx

Monte Carlo tree search in JAX
Apache License 2.0
2.31k stars 188 forks source link

I would like to ask a question and hope you can help me answer it. #80

Closed Nightbringers closed 9 months ago

Nightbringers commented 9 months ago

In muzero paper, it has some transform about value and reward. this is paper describe: For value and reward prediction in Atari we follow [30] in scaling targets using an invertible transform h(x) = sign(x)(p|x| + 1 − 1 + εx), where ε = 0.001 in all our experiments. We then apply a transformation φ to the scalar reward and value targets in order to obtain equivalent categorical representations. We use a discrete support set of size 601 with one support for every integer between −300 and 300. Under this transformation, each scalar is represented as the linear combination of its two adjacent supports, such that the original value can be recovered by x = xlow ∗ plow + xhigh ∗ phigh. As an example, a target of 3.7 would be represented as a weight of 0.3 on the support for 3 and a weight of 0.7 on the support for 4. The value and reward outputs of the network are also modeled using a softmax output of size 601. During inference the actual value and rewards are obtained by first computing their expected value under their respective softmax distribution and subsequently by inverting the scaling transformation. Scaling and transformation of the value and reward happens transparently on the network side and is not visible to the rest of the algorithm.

questions1: That's saids in Atari game. I want know in go game, is there still this thing? Or it just like alphazero, value is a single number between -1 and 1?

here is another describe in paper: Note that, in board games without intermediate rewards, we omit the reward prediction loss. For board games, we bootstrap directly to the end of the game, equivalent to predicting the final outcome

questions2: Does this means reward are useless in go game? we could just set to zero all the time?

fidlej commented 9 months ago

Thanks for the clear questions.

  1. On Go, the value is between -1 and 1.
  2. Yes. Go has no intermediate rewards, only the outcome of the game.