Closed Carbon225 closed 1 year ago
Thanks for sharing your connect4 code. It is very nicely documented.
And thanks for asking for the clarification. The rewards along a path are summed together to estimate the value of a parent node. I.e., parent_return = reward + discount * child_return https://github.com/deepmind/mctx/blob/aa55375dff40b4ae128680bf6ba2d0874e54fbc3/mctx/_src/search.py#LL270C1-L270C1
To implement an absorbing state, multiple possibilities can achieve the same effect. For example, if not using discount=0, the environment would need to remain in a state with reward=0 and value=end_game_value. Your usage of reward, value, and discount makes sense.
Great! I see now.
Thank you so much, I actually wrote a lot of questions in this text box but finally I understand everything 👍
Hi, I really appreciate your library. I will be using it for my thesis project and need to understand how it works.
For this purpose, I used it to implement classic MCTS with random rollouts in a jupyter notebook. https://github.com/Carbon225/mctx-classic I want it to be as informative as possible and to explain why every line is the way it is so my teammates can understand it as well. Feel free to add/ignore this example in your readme.
Below I will describe the last aspect I don't feel like I understand.
Consider this definition of the
recurrent_fn
:I have read that the terminal node is considered absorbing. From my understanding this means that below this node all rewards and values should be 0. This should be guaranteed by setting discount to 0 in the
RecurrentFnOutput
of the terminal node. In other examples ofmctx
I have seen people setting the value to 0 at the terminal node as well. Which is correct? When should the reward/value/discount be set to 0?I also believe the
reward
field is never actually used by the search. It's only used for training outside themctx
library, correct?