sampling action from tree-search policy

drasros commented 1 year ago

Thanks for releasing this library!!

In the Gumbel MuZero paper (appendix F), I read:

Does 'benefits from the same exploration' mean 'benefits from exploration with same sampling proportionally to counts (also with temperature I assume?)' or 'also benefits from exploration, as implemented with Gumbel noise on root node' ?

In other words, with policy_output = gumbel_muzero_policy(...), policy_output.action already contains a selected action, 'sampled' here with the same gumbel that was used for building the tree. Is this selected action supposed to be used for building training trajectories or should I rather sample action proportionally to policy_output.search_tree.summary().visit_counts as Appendix F and Figure 8 seem to suggest ?

fidlej commented 1 year ago

Thanks for asking. To reproduce the exploration on Go, during training, you should sample proportionally to the visit_counts, in the first 30 moves. This is a heuristic to ensure that the policy is trained in many states. You can find better ways in the future.

drasros commented 1 year ago

thank you for your answer.

fidlej commented 1 year ago

To avoid biasing the value network, you can first select randomly a state from the first 30 moves and then start training from that state. After starting from the state, the acting can use the policy_output.action.

drasros commented 1 year ago

Oh, I see, thanks! So, to summarize, if I understand you correctly (and if any future reader):

From a tree search result we have A_{t+1}=policy_output.action , and counts.

during exploratory self play (b) we sample a{t+1} from counts (for first 30 moves then greedy with counts) but also store A{t+1} along with trajectory
during training (c) we use A{t+1} for unrolling the network (and not a{t+1}).

drasros commented 1 year ago

which then begs the following question: for non-exploratory self-play (=at evaluation), do you use a{t+1} (greedy w.r.t counts) or A{t+1} ?

fidlej commented 1 year ago

I would correct the suggestion under the diagram:

During exploratory self play (b) we sample a_{t+1} from counts for first num_exploration_moves, where num_explorations_moves is sampled uniformly from {0, 1, ..., 30}.
I would not train in the first num_explorations_moves. (This is a simple way to avoid a bias, you can find other tricks.)
After the exploratory moves, training would act with policy_output.action. The model is always trained with the actions used in the environment.

For evaluation, use policy_output.action.

drasros commented 1 year ago

Thank you, now it is all clear, and indeed it is not what I first understood. (maybe it could be made more clear in the paper?). Have a great day.

carlosgmartin commented 1 year ago

@fidlej I have a question regarding the gumbel_scale argument to gumbel_muzero_policy. Is the following correct?

gumbel_scale = 1 to sample actions during training
gumbel_scale = 1 to create the targets for the policy loss during training
gumbel_scale = 0 to evaluate

fidlej commented 1 year ago

You can use gumbel_scale=1 also for evaluation. On partially observable games or with an imperfect memory, the stochastic policy is helpful.

For example, imagine that your car does not display the amount of fuel left. A deterministic policy would stop at every gas station. A stochastic policy can instead stop at a gas station with a probability.

p3achyjr commented 1 year ago

qq about this--if we sample according to visit counts, wouldn't the target have a fixed shape? i.e. for 16 visits, 4 actions, we would have [2, 2, 6, 6], and for 24, 8, we would have [1, 1, 1, 1, 2, 2, 4, 4].

fidlej commented 1 year ago

@p3achyjr What do you mean by the shape of the target?

p3achyjr commented 1 year ago

In PUCT, iiuc the visit distribution can take on any shape (I.e the visits can be distributed uniformly, or with a single action receiving all visits, or anywhere in between). In Sequential Halving, isn't the visit distribution for a given n, m fixed? Since we assign n/mlog(m) visits to each action in the first round, n/(m/2)log(m) in the second round, etc.

fidlej commented 1 year ago

Thanks for the clarification. Yes, the shape of the distribution is fixed. I would not use the visit counts as a target. The visit counts are used to produce an explorative move.

google-deepmind / mctx

sampling action from tree-search policy #38