Closed carlosgmartin closed 1 year ago
Thanks for sharing the minimal example. I can clear one confusion: The action
passed to chance_recurrent_fn(params, key, action, afterstate)
is actually the chance outcome. To give different actions different rewards, modify the decision_recurrent_fn
to output different afterststate for each action.
You can take an inspiration from the bandit in the tests: https://github.com/deepmind/mctx/blob/bfb7316b96f9e5b04744e8872c1abba9b2dac6b9/mctx/_src/tests/policies_test.py#L42
I will improve the documentation for the chance_recurrent_fn. Sorry for the confusion.
@fidlej Thanks for your reply. Perhaps the argument can be renamed to outcome
, for clarity?
@fidlej Any idea about the children_rewards
issue?
You can see that the output.search_tree contains only the actions relevant for the decision nodes. The masking is done here: https://github.com/deepmind/mctx/blob/bfb7316b96f9e5b04744e8872c1abba9b2dac6b9/mctx/_src/policies.py#L366
The zeros in the children_rewards then make sense. The reward is zero for the children of the decision nodes.
I'm having issues with
mctx.stochastic_muzero_policy
. Here's an example:The first issue is that the
children_rewards
are all 0, despite the fact thatchance_recurrent_fn
always yields a positive reward.The second issue is that the final weight of the zeroth action (which receives an additional reward of 100) is not higher than the rest, despite a large number of simulations.
Any idea what might be causing these issues?